# Web scraping problem

### Webscraping to Extract Tesla Revenue Data

In this project, we are going to scrape the Tesla revenue data and store it in a dataframe. 

To know whether a website allows web scraping or not, you can look at the website’s “robots.txt” file. You can find this file by appending “/robots.txt” to the URL that you want to scrape.

**Step 1:** Make sure you have sqlite3, bs4 and pandas installed. In case they are not installed, you can use the following command in the terminal:

```py
pip install pandas
```

and 

```py
pip install sqlite3
```

**Step 2:** Import libraries

If you are using a notebook, run the cell. If you are using a text editor make sure to use print to execute the code.

In [10]:
# import packages

import pandas as pd
import requests
from bs4 import BeautifulSoup
import sqlite3


**Step 3:** Use the requests library to download the webpage https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue. Save the text of the response as a variable named html_data.

In [29]:
 
url = " https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data = requests.get(url).text


**Step 4:** Parse the html data using beautiful_soup

In [None]:
soup = BeautifulSoup(html_data,"html.parser")

**Step 5:** Using beautiful soup extract the table with Tesla Quarterly Revenue and store it into a dataframe named tesla_revenue. The dataframe should have columns Date and Revenue. Make sure the comma and dollar sign is removed from the Revenue column.

In [None]:
tables = soup.find_all('table')
for index,table in enumerate(tables):
    if ("Tesla Quarterly Revenue" in str(table)):
        table_index = index
Tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        Date = col[0].text
        Revenue = col[1].text.replace("$", "").replace(",", "")
        Tesla_revenue = Tesla_revenue.append({"Date":Date, "Revenue":Revenue}, ignore_index=True)

**Step 6:** Remove the rows in the dataframe that are empty strings or are NaN in the Revenue column. Print the entire tesla_revenue DataFrame to see if you have any.

In [30]:
Tesla_revenue = Tesla_revenue[Tesla_revenue['Revenue'] != ""]
Tesla_revenue

Unnamed: 0,Date,Revenue
0,2022-03-31,18756
1,2021-12-31,17719
2,2021-09-30,13757
3,2021-06-30,11958
4,2021-03-31,10389
5,2020-12-31,10744
6,2020-09-30,8771
7,2020-06-30,6036
8,2020-03-31,5985
9,2019-12-31,7384


**Step 7:** Make sure Tesla_revenue is still a dataframe

In [31]:
type(Tesla_revenue)

pandas.core.frame.DataFrame

**Step 8:** Insert the data into sqlite3 by converting the dataframe into a list of tuples

In [32]:


records = Tesla_revenue.to_records(index=False)
list_of_tuples = list(records)
list_of_tuples

[('2022-03-31', '18756'),
 ('2021-12-31', '17719'),
 ('2021-09-30', '13757'),
 ('2021-06-30', '11958'),
 ('2021-03-31', '10389'),
 ('2020-12-31', '10744'),
 ('2020-09-30', '8771'),
 ('2020-06-30', '6036'),
 ('2020-03-31', '5985'),
 ('2019-12-31', '7384'),
 ('2019-09-30', '6303'),
 ('2019-06-30', '6350'),
 ('2019-03-31', '4541'),
 ('2018-12-31', '7226'),
 ('2018-09-30', '6824'),
 ('2018-06-30', '4002'),
 ('2018-03-31', '3409'),
 ('2017-12-31', '3288'),
 ('2017-09-30', '2985'),
 ('2017-06-30', '2790'),
 ('2017-03-31', '2696'),
 ('2016-12-31', '2285'),
 ('2016-09-30', '2298'),
 ('2016-06-30', '1270'),
 ('2016-03-31', '1147'),
 ('2015-12-31', '1214'),
 ('2015-09-30', '937'),
 ('2015-06-30', '955'),
 ('2015-03-31', '940'),
 ('2014-12-31', '957'),
 ('2014-09-30', '852'),
 ('2014-06-30', '769'),
 ('2014-03-31', '621'),
 ('2013-12-31', '615'),
 ('2013-09-30', '431'),
 ('2013-06-30', '405'),
 ('2013-03-31', '562'),
 ('2012-12-31', '306'),
 ('2012-09-30', '50'),
 ('2012-06-30', '27'),
 ('2012-03

**Step 9:** Now let's create our SQLite3 database. The following command is to connect to a Sqlite3 database. In case the databse does not exist, it will create it.

In [33]:
# Use the connect() function of sqlite3 to create a database. It will create a connection object.

connection = sqlite3.connect('Tesla.db')

**Step 10:** Let's create a table in our database to store our revenue values.

In [35]:
c = connection.cursor()

# Create table
c.execute('''CREATE TABLE revenue
             (Date, Revenue)''')

<sqlite3.Cursor at 0x7f6fd433e2d0>

In [36]:
# Insert the values
c.executemany('INSERT INTO revenue VALUES (?,?)', list_of_tuples)
# Save (commit) the changes
connection.commit()

**Step 11:** Now retrieve the data from the database

In [37]:
for row in c.execute('SELECT * FROM revenue'):
    print(row)

('2022-03-31', '18756')
('2021-12-31', '17719')
('2021-09-30', '13757')
('2021-06-30', '11958')
('2021-03-31', '10389')
('2020-12-31', '10744')
('2020-09-30', '8771')
('2020-06-30', '6036')
('2020-03-31', '5985')
('2019-12-31', '7384')
('2019-09-30', '6303')
('2019-06-30', '6350')
('2019-03-31', '4541')
('2018-12-31', '7226')
('2018-09-30', '6824')
('2018-06-30', '4002')
('2018-03-31', '3409')
('2017-12-31', '3288')
('2017-09-30', '2985')
('2017-06-30', '2790')
('2017-03-31', '2696')
('2016-12-31', '2285')
('2016-09-30', '2298')
('2016-06-30', '1270')
('2016-03-31', '1147')
('2015-12-31', '1214')
('2015-09-30', '937')
('2015-06-30', '955')
('2015-03-31', '940')
('2014-12-31', '957')
('2014-09-30', '852')
('2014-06-30', '769')
('2014-03-31', '621')
('2013-12-31', '615')
('2013-09-30', '431')
('2013-06-30', '405')
('2013-03-31', '562')
('2012-12-31', '306')
('2012-09-30', '50')
('2012-06-30', '27')
('2012-03-31', '30')
('2011-12-31', '39')
('2011-09-30', '58')
('2011-06-30', '58')
('2011

Our database name is “Tesla.db”. We saved the connection to the connection object.

Next time we run this file, it just connects to the database, and if the database is not there, it will create one.

Source:

https://github.com/bhavyaramgiri/Web-Scraping-and-sqlite3/blob/master/week%209-%20web%20scraping%20sqlite.ipynb

https://coderspacket.com/scraping-the-web-page-and-storing-it-in-a-sqlite3-database