#                                 Data Collection with an API

TO build machine learning models, we need data, there are multiple sources of data on the internet, on different databases and on nature.
in this notebook, we try to access different data sources using an API.

## 1.1 Installing Dependencies

In [10]:
!pip install Kaggle

Collecting Kaggle
  Downloading kaggle-1.6.8.tar.gz (84 kB)
     ---------------------------------------- 0.0/84.6 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/84.6 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/84.6 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/84.6 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/84.6 kB ? eta -:--:--
     ----------------------- -------------- 51.2/84.6 kB 201.8 kB/s eta 0:00:01
     ----------------------- -------------- 51.2/84.6 kB 201.8 kB/s eta 0:00:01
     ------------------------------------ - 81.9/84.6 kB 229.0 kB/s eta 0:00:01
     ------------------------------------ - 81.9/84.6 kB 229.0 kB/s eta 0:00:01
     ------------------------------------ - 81.9/84.6 kB 229.0 kB/s eta 0:00:01
     ------------------------------------ - 81.9/84.6 kB 229.0 kB/s eta 0:00:01
     ------------------------------------ - 81.9/84.6 kB 229.0 kB/s eta 0:00:01
     -----------

In [76]:
!pip install BeautifulSoup4



## 1.2 Importing necessary libraries

In [24]:
import requests # we use requests in order to make HTTP requests 
import pandas as pd # pandas is used for data manipulation and analysis
import numpy as np # numpy helps with large, multidimensional arrays and matrices
import kaggle # open source containing many datasets
from kaggle.api.kaggle_api_extended import KaggleApi #getting the kaggle API

In [27]:
pd.set_option('display.max_columns', None) #set columns of the dataframe
pd.set_option('display.max_colwidth',None) # set features of the dataframe

api=KaggleApi() #calling the kaggle API
api.authenticate() # we authenticate the API of kaggle

## 1.2 Loading the data


Getting data form a competition in kaggle 

In [29]:
api.competition_download_file('titanic','train.csv') # we download the train set of the competition Titanic on Kaggle

train.csv: Skipping, found more recently modified local copy (use --force to force download)


In [33]:
df=pd.read_csv('train.csv') # we load our datasets using pandas library

In [34]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [38]:
api.competition_download_file('titanic','test.csv')

test.csv: Skipping, found more recently modified local copy (use --force to force download)


In [40]:
test_data=pd.read_csv('test.csv')

In [41]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [42]:
df.describe() # we check the statistical analysis of the train dataset

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [53]:
print(df[['Age']].value_counts()), # we also can check the values of each feature in our dataset
df[['Fare']].value_counts()

Age  
24.00    30
22.00    27
18.00    26
30.00    25
28.00    25
         ..
20.50     1
14.50     1
12.00     1
0.92      1
80.00     1
Name: count, Length: 88, dtype: int64


Fare   
8.0500     43
13.0000    42
7.8958     38
7.7500     34
26.0000    31
           ..
8.1125      1
8.1375      1
17.4000     1
8.1583      1
7.7292      1
Name: count, Length: 248, dtype: int64

In [55]:
df.dtypes # we also check the types of the data in different features

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [57]:
df.shape # we check the dimensionality of our dataset

(891, 12)

## Let us download the data from the dataset with a Kaggle API 

In [64]:
api.dataset_download_file('prathamtripathi/drug-classification',file_name='drug200.csv')

False

In [68]:
drug_data=pd.read_csv('drug200.csv')
drug_data

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,DrugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,DrugY
...,...,...,...,...,...,...
195,56,F,LOW,HIGH,11.567,drugC
196,16,M,LOW,HIGH,12.006,drugC
197,52,M,NORMAL,HIGH,9.894,drugX
198,23,M,NORMAL,NORMAL,14.020,drugX


In [71]:
drug_data.describe()

Unnamed: 0,Age,Na_to_K
count,200.0,200.0
mean,44.315,16.084485
std,16.544315,7.223956
min,15.0,6.269
25%,31.0,10.4455
50%,45.0,13.9365
75%,58.0,19.38
max,74.0,38.247


In [74]:
drug_data.dtypes

Age              int64
Sex             object
BP              object
Cholesterol     object
Na_to_K        float64
Drug            object
dtype: object

# Data Collection using Webscraping

In [81]:
from bs4 import BeautifulSoup    # we import the necessary libraries
import re
import unicodedata
import sys

### we scrap the data of the US economy from Wikipedia

In [85]:
# we get the url and save it in variable
url='https://en.wikipedia.org/wiki/Economy_of_the_United_States'

In [86]:
page=requests.get(url).text # we use the get function in request in order to call the website

In [89]:
soup=BeautifulSoup(page,'html.parser') # we use beautifulsoup to navigate the page

In [90]:
soup.title

<title>Economy of the United States - Wikipedia</title>

### let us extract the columns and variables from the HTML table header

In [93]:
# we first find all the tables that are in the html page
html_tables=soup.find_all('table')
html_tables

[<table class="infobox" style="width:26.0em;padding:0;"><caption class="infobox-title adr">Economy of <span class="country-name">the United States</span></caption><tbody><tr><td class="infobox-image" colspan="2"><span typeof="mw:File"><a class="mw-file-description" href="/wiki/File:Luchtfoto_van_Lower_Manhattan.jpg"><img class="mw-file-element" data-file-height="3744" data-file-width="5616" decoding="async" height="200" src="//upload.wikimedia.org/wikipedia/commons/thumb/6/69/Luchtfoto_van_Lower_Manhattan.jpg/300px-Luchtfoto_van_Lower_Manhattan.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/6/69/Luchtfoto_van_Lower_Manhattan.jpg/450px-Luchtfoto_van_Lower_Manhattan.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/6/69/Luchtfoto_van_Lower_Manhattan.jpg/600px-Luchtfoto_van_Lower_Manhattan.jpg 2x" width="300"/></a></span><div class="infobox-caption"><a href="/wiki/New_York_City" title="New York City">New York City</a>, the world’s principal <a href="/wiki/Fintech" titl

In [104]:
# from the table one might finally choose the kind of the data tha intersts him/her
first_table=html_tables[1]
second_table=html_tables[2]
third_table=html_tables[3]
fourth_table=html_tables[4]
fifth_table=html_tables[5]
sixth_table=html_tables[6]

In [105]:
third_table

<table class="wikitable" style="text-align:center;">
<tbody><tr>
<th>Year
</th>
<th>GDP
<p><small>(in Bil. US$PPP)</small>
</p>
</th>
<th>GDP per capita
<p><small>(in US$ PPP)</small>
</p>
</th>
<th>GDP
<p><small>(in Bil. US$nominal)</small>
</p>
</th>
<th>GDP per capita
<p><small>(in US$ nominal)</small>
</p>
</th>
<th>GDP growth
<p><small>(real)</small>
</p>
</th>
<th>Inflation rate
<p><small>(in Percent)</small>
</p>
</th>
<th>Unemployment
<p><small>(in Percent)</small>
</p>
</th>
<th>Government debt
<p><small>(in % of GDP)</small>
</p>
</th></tr>
<tr>
<td>1980
</td>
<td>2,857.3
</td>
<td>12,552.9
</td>
<td>2,857.3
</td>
<td>12,552.9
</td>
<td><span typeof="mw:File"><span title="Decrease"><img alt="Decrease" class="mw-file-element" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Decrease2.svg/11px-Decrease2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Decrease2.svg/17px-Decreas

In [107]:
sixth_table

<table class="sidebar sidebar-collapse nomobile nowraplinks vcard hlist"><tbody><tr><td class="sidebar-pretitle">This article is part of a series on</td></tr><tr><th class="sidebar-title-with-pretitle" style="background:lavender"><a href="/wiki/Income_in_the_United_States" title="Income in the United States"><small>Income in the</small><br/>United States of America</a></th></tr><tr><td class="sidebar-image"><span typeof="mw:File"><a class="mw-file-description" href="/wiki/File:US_50_Cent_Rev.png"><img class="mw-file-element" data-file-height="2000" data-file-width="2000" decoding="async" height="110" src="//upload.wikimedia.org/wikipedia/commons/thumb/2/2d/US_50_Cent_Rev.png/110px-US_50_Cent_Rev.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/2/2d/US_50_Cent_Rev.png/165px-US_50_Cent_Rev.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/2/2d/US_50_Cent_Rev.png/220px-US_50_Cent_Rev.png 2x" width="110"/></a></span></td></tr><tr><td class="sidebar-content">
<div class="s

#### Let us consider the third table

In [113]:
df=pd.read_html(url) # this code reads all the data from the Wikipedia 
df1=pd.read_html(str(soup)) # this code reads the soup of the tables from the html page
data=df1[3] # this code selects the order at which the table/data is located
data

Unnamed: 0,Year,GDP (in Bil. US$PPP),GDP per capita (in US$ PPP),GDP (in Bil. US$nominal),GDP per capita (in US$ nominal),GDP growth (real),Inflation rate (in Percent),Unemployment (in Percent),Government debt (in % of GDP)
0,1980,2857.3,12552.9,2857.3,12552.9,-0.3%,13.5%,7.2%,
1,1981,3207.0,13948.7,3207.0,13948.7,2.5%,10.4%,7.6%,
2,1982,3343.8,14405.0,3343.8,14405.0,-1.8%,6.2%,9.7%,
3,1983,3634.0,15513.7,3634.0,15513.7,4.6%,3.2%,9.6%,
4,1984,4037.7,17086.4,4037.7,17086.4,7.2%,4.4%,7.5%,
5,1985,4339.0,18199.3,4339.0,18199.3,4.2%,3.5%,7.2%,
6,1986,4579.6,19034.8,4579.6,19034.8,3.5%,1.9%,7.0%,
7,1987,4855.3,20001.0,4855.3,20001.0,3.5%,3.6%,6.2%,
8,1988,5236.4,21376.0,5236.4,21376.0,4.2%,4.1%,5.5%,
9,1989,5641.6,22814.1,5641.6,22814.1,3.7%,4.8%,5.3%,


In [115]:
data.describe()

Unnamed: 0,Year,GDP (in Bil. US$PPP),GDP per capita (in US$ PPP),GDP (in Bil. US$nominal),GDP per capita (in US$ nominal)
count,49.0,49.0,49.0,49.0,49.0
mean,2004.0,13666.359184,44382.740816,13666.359184,44382.740816
std,14.28869,8365.841449,22616.558373,8365.841449,22616.558373
min,1980.0,2857.3,12552.9,2857.3,12552.9
25%,1992.0,6520.3,25392.9,6520.3,25392.9
50%,2004.0,12217.2,41641.6,12217.2,41641.6
75%,2016.0,18695.1,57840.0,18695.1,57840.0
max,2028.0,32349.7,93259.3,32349.7,93259.3


In [116]:
data.dtypes

Year                                 int64
GDP (in Bil. US$PPP)               float64
GDP per capita (in US$ PPP)        float64
GDP (in Bil. US$nominal)           float64
GDP per capita (in US$ nominal)    float64
GDP growth (real)                   object
Inflation rate (in Percent)         object
Unemployment (in Percent)           object
Government debt (in % of GDP)       object
dtype: object

### let us see the remaining tables on the page

In [121]:
data2=df1[4]
data2

Unnamed: 0,No.,Country/Economy,Real GDP,Agri.,Indus.,Serv.
0,–,World,60093221,1968215,16453140,38396695
1,1,United States,15160104,149023,3042332,11518980


In [122]:
data3=df1[5]
data3

Unnamed: 0,No.,Country/Economy,Nominal GDP,Agri.,Indus.,Serv.
0,1,United States,18624450,204868.95,3613143.3,14806437.75
1,*Percentages from CIA World Factbook[137],*Percentages from CIA World Factbook[137],*Percentages from CIA World Factbook[137],*Percentages from CIA World Factbook[137],*Percentages from CIA World Factbook[137],*Percentages from CIA World Factbook[137]


In [126]:
data4=df[9]
data4

Unnamed: 0_level_0,Balance of trade 2014 (goods only)[361],Balance of trade 2014 (goods only)[361],Balance of trade 2014 (goods only)[361],Balance of trade 2014 (goods only)[361],Balance of trade 2014 (goods only)[361],Balance of trade 2014 (goods only)[361],Balance of trade 2014 (goods only)[361],Balance of trade 2014 (goods only)[361],Balance of trade 2014 (goods only)[361],Balance of trade 2014 (goods only)[361]
Unnamed: 0_level_1,Unnamed: 0_level_1,China,Euro area,Japan,Mexico,Pacific,Canada,Middle East,Latin America,Total by product
0,Computer,−151.9,3.4,−8.0,−11.0,−26.1,20.9,5.8,12.1,−155.0
1,"Oil, gas, minerals",1.9,6.4,2.4,−20.8,1.1,−79.8,−45.1,−15.9,−149.7
2,Transportation,10.9,−30.9,−46.2,−59.5,−0.5,−6.1,17.1,8.8,−106.3
3,Apparel,−56.3,−4.9,0.6,−4.2,−6.3,2.5,−0.3,−1.1,−69.9
4,Electrical equipment,−35.9,−2.4,−4.0,−8.5,−3.3,10.0,1.8,2.0,−40.4
5,Misc. manufacturing,−35.3,4.9,2.7,−2.8,−1.4,5.8,−1.5,1.8,−25.8
6,Furniture,−18.3,−1.2,0.0,−1.6,−2.1,0.4,0.2,0.0,−22.6
7,Machinery,−19.9,−27.0,−18.8,3.9,7.6,18.1,4.5,9.1,−22.4
8,Primary metals,−3.1,3.1,−1.8,1.0,1.9,−8.9,−0.9,−10.4,−19.1
9,Fabricated metals,−17.9,−5.9,−3.5,2.8,−4.3,7.3,1.2,1.9,−18.5
