DAP – Data Analysis Process
-	Data Gathering 
-	Data Cleaning
-	EDA

Data Gathering
1)	Import
-	Database
-	CSV
-	Excel
-	Text
-	Json

2)	Export
-	csv
-	json
-	Excel
-	Database
-	Html

3)	API
4)	Web Scraping


### What is Data Analysis
Data Analysis is a process of inspecting, cleansing, transforming 
and modeling data with the goal of discovering useful information, 
informing conclusions and supporting decision-making

Part 1 - Asking Wright Questions 
- Subject Matter Expertise
- Experience

Part 2 - Data Wrangling/Munging

is the process of transforming and mapping data from one raw data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purpose such as analytics
1. Gathering Data
2. Assessing Data
- Finding the number of rows/columns (shape)
- Data Types of various columns (info())
- Checking for missing values (info())
- Check for duplicate data (is_unique)
- Memory occupied by the dataset (info)
- High level mathematical overview of the data (describe)
3. Cleaning Data
- Missing Data (e.g mean)
- Remove Duplicate data (drop_duplicates)
- Incorrect data type (astype)

Part 3 - Exploratory Data Analysis 
1. Exploring Data
- Finding Correlation and covariance
- Doing univariate and multivariate analysis
- Plotting graphs (Data Visualization)
2. Augmenting Data

These operations are collectively called as Feature Engineering
- Removing Outliers (using box plot)
- Merging DataFrames
- Adding Column

Part 4 - Drawing Conclusions
- ML
- Inferential Statistics
- Descriptive Statistics

Part 5 - Communicating Results
- In person
- Reports
- Blog Post
- PPTs/Slide decks

# Working with CSV

In [1]:
# Importing pandas
import pandas as pd

In [2]:
# Opening a local csv file
df = pd.read_csv('day_15/aug_train.csv')
print(df.shape)
df.head()

(19158, 14)


Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [3]:
# Opening a csv file from a URL

import requests
from io import StringIO

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)

pd.read_csv(data)

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA
...,...,...
189,Paraguay,SOUTH AMERICA
190,Peru,SOUTH AMERICA
191,Suriname,SOUTH AMERICA
192,Uruguay,SOUTH AMERICA


In [4]:
# sep parameter (separator)

pd.read_csv('day_15/movie_titles_metadata.tsv', sep='\t',
            names=['S.No.', 'movie_Name', 'release_year', 'rating', 'votes', 'genres'])

Unnamed: 0,S.No.,movie_Name,release_year,rating,votes,genres
0,m0,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
1,m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
2,m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
3,m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
4,m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']
...,...,...,...,...,...,...
612,m612,watchmen,2009,7.8,135229.0,['action' 'crime' 'fantasy' 'mystery' 'sci-fi'...
613,m613,xxx,2002,5.6,53505.0,['action' 'adventure' 'crime']
614,m614,x-men,2000,7.4,122149.0,['action' 'sci-fi']
615,m615,young frankenstein,1974,8.0,57618.0,['comedy' 'sci-fi']


In [5]:
# index_col parameter
pd.read_csv('day_15/aug_train.csv', index_col='enrollee_id')

Unnamed: 0_level_0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


In [6]:
# header parameter
pd.read_csv('day_15/test.csv', header=1)

Unnamed: 0,0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0
1,2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0
2,3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1
3,4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0


In [7]:
# use_cols parameter
pd.read_csv('day_15/aug_train.csv', usecols=['enrollee_id', 'gender', 'education_level'])

Unnamed: 0,enrollee_id,gender,education_level
0,8949,Male,Graduate
1,29725,Male,Graduate
2,11561,,Graduate
3,33241,,Graduate
4,666,Male,Masters
...,...,...,...
19153,7386,Male,Graduate
19154,31398,Male,Graduate
19155,24576,Male,Graduate
19156,5756,Male,High School


In [8]:
pd.read_csv('day_15/aug_train.csv').columns

Index(['enrollee_id', 'city', 'city_development_index', 'gender',
       'relevent_experience', 'enrolled_university', 'education_level',
       'major_discipline', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours', 'target'],
      dtype='object')

In [9]:
pd.read_csv('day_15/aug_train.csv').head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [10]:
# skiprows/nrows parameter
# don't enter 0, because it is the header of the DataFrame
pd.read_csv('day_15/aug_train.csv', skiprows=[1, 3])

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
1,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
2,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
3,21651,city_176,0.764,,Has relevent experience,Part time course,Graduate,STEM,11,,,1,24,1.0
4,28806,city_160,0.920,Male,Has relevent experience,no_enrollment,High School,,5,50-99,Funded Startup,1,24,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19151,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19152,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19153,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19154,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


In [11]:
pd.read_csv('day_15/aug_train.csv', skiprows=lambda x: x in [1, 3])

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
1,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
2,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
3,21651,city_176,0.764,,Has relevent experience,Part time course,Graduate,STEM,11,,,1,24,1.0
4,28806,city_160,0.920,Male,Has relevent experience,no_enrollment,High School,,5,50-99,Funded Startup,1,24,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19151,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19152,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19153,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19154,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


In [12]:
pd.read_csv('day_15/aug_train.csv', nrows=100)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,12081,city_65,0.802,Male,Has relevent experience,Full time course,Graduate,STEM,9,50-99,Pvt Ltd,1,33,0.0
96,7364,city_160,0.920,,No relevent experience,Full time course,High School,,2,100-500,Pvt Ltd,1,142,0.0
97,11184,city_74,0.579,,No relevent experience,Full time course,Graduate,STEM,2,100-500,Pvt Ltd,1,34,0.0
98,7016,city_65,0.802,Male,Has relevent experience,no_enrollment,Graduate,STEM,6,50-99,Pvt Ltd,2,14,1.0


In [13]:
# encoding parameter
pd.read_csv('day_15/zomato.csv')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7044: invalid continuation byte

In [None]:
# inorder to solve the encoding problem use sublime text or other editor to change it
# or else find out the encoding
pd.read_csv('day_15/zomato.csv', encoding='latin-1')

In [14]:
# skip bad lines
# used to solve parser errors
pd.read_csv('day_15/aug_train.csv', on_bad_lines='skip')

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


In [15]:
# dtypes parameter
# target is in float64 format, which is unnecessary, changing it to int now 
pd.read_csv('day_15/aug_train.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [16]:
df = pd.read_csv('day_15/aug_train.csv', dtype={'target': int})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  int32  
dtypes: float64(1), int32(1), int64(2), obj

In [17]:
# handling dates
# date is in Object format
pd.read_csv('day_15/IPL Matches 2008-2020.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               816 non-null    int64  
 1   city             803 non-null    object 
 2   date             816 non-null    object 
 3   player_of_match  812 non-null    object 
 4   venue            816 non-null    object 
 5   neutral_venue    816 non-null    int64  
 6   team1            816 non-null    object 
 7   team2            816 non-null    object 
 8   toss_winner      816 non-null    object 
 9   toss_decision    816 non-null    object 
 10  winner           812 non-null    object 
 11  result           812 non-null    object 
 12  result_margin    799 non-null    float64
 13  eliminator       812 non-null    object 
 14  method           19 non-null     object 
 15  umpire1          816 non-null    object 
 16  umpire2          816 non-null    object 
dtypes: float64(1), i

In [18]:
pd.read_csv('day_15/IPL Matches 2008-2020.csv', parse_dates=['date']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   id               816 non-null    int64         
 1   city             803 non-null    object        
 2   date             816 non-null    datetime64[ns]
 3   player_of_match  812 non-null    object        
 4   venue            816 non-null    object        
 5   neutral_venue    816 non-null    int64         
 6   team1            816 non-null    object        
 7   team2            816 non-null    object        
 8   toss_winner      816 non-null    object        
 9   toss_decision    816 non-null    object        
 10  winner           812 non-null    object        
 11  result           812 non-null    object        
 12  result_margin    799 non-null    float64       
 13  eliminator       812 non-null    object        
 14  method           19 non-null     object   

In [19]:
data = {
    'day': [1, 15, 23, 5, 19],
    'month': [1, 3, 5, 8, 12],
    'year': [2022, 2023, 2023, 2024, 2025]
}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,day,month,year
0,1,1,2022
1,15,3,2023
2,23,5,2023
3,5,8,2024
4,19,12,2025


In [20]:
df.to_csv('dates_df.csv')

In [21]:
df = pd.read_csv('dates_df.csv', index_col=0)
df.head()

Unnamed: 0,day,month,year
0,1,1,2022
1,15,3,2023
2,23,5,2023
3,5,8,2024
4,19,12,2025


In [22]:
df['date'] = pd.to_datetime(df[['day', 'month', 'year']])
df.head()

Unnamed: 0,day,month,year,date
0,1,1,2022,2022-01-01
1,15,3,2023,2023-03-15
2,23,5,2023,2023-05-23
3,5,8,2024,2024-08-05
4,19,12,2025,2025-12-19


In [23]:
# convertors
def rename(name):
    if name == 'Mumbai Indians':
        return 'MI'
    else:
        return name

In [24]:
rename('Mumbai Indians')

'MI'

In [25]:
df = pd.read_csv('day_15/IPL Matches 2008-2020.csv', converters={'team1': rename})
df.head()

Unnamed: 0,id,city,date,player_of_match,venue,neutral_venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,eliminator,method,umpire1,umpire2
0,335982,Bangalore,2008-04-18,BB McCullum,M Chinnaswamy Stadium,0,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,N,,Asad Rauf,RE Koertzen
1,335983,Chandigarh,2008-04-19,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",0,Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,N,,MR Benson,SL Shastri
2,335984,Delhi,2008-04-19,MF Maharoof,Feroz Shah Kotla,0,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,N,,Aleem Dar,GA Pratapkumar
3,335985,Mumbai,2008-04-20,MV Boucher,Wankhede Stadium,0,MI,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,N,,SJ Davis,DJ Harper
4,335986,Kolkata,2008-04-20,DJ Hussey,Eden Gardens,0,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,N,,BF Bowden,K Hariharan


In [26]:
# na_values parameter
# what values do you want to consider as NaN values
# example  hyphen - or 
# here for example we converted Male to NaN values
pd.read_csv('day_15/aug_train.csv', na_values='Male')

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


In [27]:
# Loading a huge dataset in chunks
dfs = pd.read_csv('day_15/aug_train.csv', chunksize=5000)

for chunk in dfs:
    print(chunk.shape)

(5000, 14)
(5000, 14)
(5000, 14)
(4158, 14)


# Working with Excel

In [28]:
pd.read_excel('zomato-schema.xlsx')

Unnamed: 0,user_id,name,email,password
0,1,Nitish,nitish@gmail.com,p252h
1,2,Khushboo,khushboo@gmail.com,hxn9b
2,3,Vartika,vartika@gmail.com,9hu7j
3,4,Ankit,ankit@gmail.com,lkko3
4,5,Neha,neha@gmail.com,3i7qm
5,6,Anupama,anupama@gmail.com,46rdw2
6,7,Rishabh,rishabh@gmail.com,4sw123


In [29]:
pd.read_excel('zomato-schema.xlsx', sheet_name='restaurants')

Unnamed: 0,r_id,r_name,cuisine
0,1,dominos,Italian
1,2,kfc,American
2,3,box8,North Indian
3,4,Dosa Plaza,South Indian
4,5,China Town,Chinese


# Reading text files

In [30]:
# assume this was tab seperated columns in a text file and we imported it

# throws an error
# pd.read_csv('movie_titles.txt', sep='\t')

# Working with JSON
JSON - JavaScript On Notation

In [31]:
json_df = pd.read_json('day_16/recipe_train.json')
print(json_df.shape)
print(json_df.head())

(39774, 3)
      id      cuisine                                        ingredients
0  10259        greek  [romaine lettuce, black olives, grape tomatoes...
1  25693  southern_us  [plain flour, ground pepper, salt, tomatoes, g...
2  20130     filipino  [eggs, pepper, salt, mayonaise, cooking oil, g...
3  22213       indian                [water, vegetable oil, wheat, salt]
4  13162       indian  [black pepper, shallots, cornflour, cayenne pe...


In [32]:
pd.read_json('https://api.exchangerate-api.com/v4/latest/INR')

Unnamed: 0,provider,WARNING_UPGRADE_TO_V6,terms,base,date,time_last_updated,rates
INR,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2024-11-05,1730764801,1.0000
AED,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2024-11-05,1730764801,0.0436
AFN,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2024-11-05,1730764801,0.7950
ALL,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2024-11-05,1730764801,1.0800
AMD,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2024-11-05,1730764801,4.6000
...,...,...,...,...,...,...,...
XPF,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2024-11-05,1730764801,1.3000
YER,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2024-11-05,1730764801,2.9700
ZAR,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2024-11-05,1730764801,0.2080
ZMW,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2024-11-05,1730764801,0.3200


# Working with SQL type

In [1]:
import mysql.connector
import pandas as pd

conn = mysql.connector.connect(host='localhost', user='root', password='', database='world')
conn

InterfaceError: 2003: Can't connect to MySQL server on 'localhost:3306' (10061 No connection could be made because the target machine actively refused it)

In [2]:
from sqlalchemy import create_engine

engine = create_engine('mysql://username:password@hostname:port/database')

ValueError: invalid literal for int() with base 10: 'port'

In [3]:
pd.read_sql_query("SELECT * FROM city", conn)

NameError: name 'conn' is not defined

In [None]:
pd.read_sql_query("SELECT * FROM city WHERE CountryCode LIKE 'USA'", conn)

In [35]:
pd.read_sql_query("SELECT * FROM country WHERE LifeExpectancy > 60", conn)

  pd.read_sql_query("SELECT * FROM country WHERE LifeExpectancy > 60", conn)


Unnamed: 0,Code,Name,Continent,Region,SurfaceArea,IndepYear,Population,LifeExpectancy,GNP,GNPOld,LocalName,GovernmentForm,HeadOfState,Capital,Code2
0,ABW,Aruba,North America,Caribbean,193.0,,103000,78.4,828.0,793.0,Aruba,Nonmetropolitan Territory of The Netherlands,Beatrix,129,AW
1,AIA,Anguilla,North America,Caribbean,96.0,,8000,76.1,63.2,,Anguilla,Dependent Territory of the UK,Elisabeth II,62,AI
2,ALB,Albania,Europe,Southern Europe,28748.0,1912.0,3401200,71.6,3205.0,2500.0,ShqipÃ«ria,Republic,Rexhep Mejdani,34,AL
3,AND,Andorra,Europe,Southern Europe,468.0,1278.0,78000,83.5,1630.0,,Andorra,Parliamentary Coprincipality,,55,AD
4,ANT,Netherlands Antilles,North America,Caribbean,800.0,,217000,74.7,1941.0,,Nederlandse Antillen,Nonmetropolitan Territory of The Netherlands,Beatrix,33,AN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162,VIR,"Virgin Islands, U.S.",North America,Caribbean,347.0,,93000,78.1,0.0,,Virgin Islands of the United States,US Territory,George W. Bush,4067,VI
163,VNM,Vietnam,Asia,Southeast Asia,331689.0,1945.0,79832000,69.3,21929.0,22834.0,ViÃªt Nam,Socialistic Republic,TrÃ¢n Duc Luong,3770,VN
164,VUT,Vanuatu,Oceania,Melanesia,12189.0,1980.0,190000,60.6,261.0,246.0,Vanuatu,Republic,John Bani,3537,VU
165,WSM,Samoa,Oceania,Polynesia,2831.0,1962.0,180000,69.2,141.0,157.0,Samoa,Parlementary Monarchy,Malietoa Tanumafili II,3169,WS


In [36]:
pd.read_sql_query("SELECT * FROM countryLanguage", conn)

  pd.read_sql_query("SELECT * FROM countryLanguage", conn)


Unnamed: 0,CountryCode,Language,IsOfficial,Percentage
0,ABW,Dutch,T,5.3
1,ABW,English,F,9.5
2,ABW,Papiamento,F,76.7
3,ABW,Spanish,F,7.4
4,AFG,Balochi,F,0.9
...,...,...,...,...
979,ZMB,Tongan,F,11.0
980,ZWE,English,T,2.2
981,ZWE,Ndebele,F,16.2
982,ZWE,Nyanja,F,2.2


# Pandas Export
- to csv
- to excel
- to html
- to json
- to sql

In [37]:
df = pd.DataFrame()
# df.to_csv()

In [38]:
# to csv
df = pd.read_csv('deliveries.csv')
df.head()

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,...,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
0,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,1,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
1,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,2,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
2,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,3,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,4,0,4,,,
3,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,4,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
4,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,5,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,2,2,,,


In [39]:
temp_df = df.groupby("batsman")['batsman_runs'].sum().reset_index().sort_values(by='batsman_runs', ascending=False)
temp_df.head()

Unnamed: 0,batsman,batsman_runs
486,V Kohli,5434
428,SK Raina,5415
367,RG Sharma,4914
112,DA Warner,4741
392,S Dhawan,4632


In [40]:
temp_df.to_csv('total_batsman_runs.csv', index=False)

In [41]:
df.pivot_table(index='batsman', columns='bowling_team', values='batsman_runs', aggfunc='sum').to_csv(
    'batsman_vs_teams.csv', index=True)

In [42]:
# to excel
temp_df.to_excel('total_batsman_runs.xlsx')

In [43]:
temp_df.to_excel('total_batsman_runs_sheet.xlsx', sheet_name='batman_runs', index=False)

In [44]:
temp_df_2 = df.pivot_table(index='batsman', columns='bowling_team', values='batsman_runs', aggfunc='sum')

# multiple sheets

with pd.ExcelWriter('total_batsman_runs_multiple_sheep.xlsx') as writer:
    temp_df.to_excel(writer, sheet_name='total_batsman_runs', index=False)
    temp_df_2.to_excel(writer, sheet_name='batsman_vs_teams')

In [45]:
# to html
df.query('batsman_runs == 6').pivot_table(index='over', columns='ball', values='batsman_runs', aggfunc='count').to_html(
    'sixes_heatmap.html')

In [46]:
# to json
df.groupby(['batting_team', 'batsman']).batsman_runs.sum().unstack().to_json('ipl.json')

In [47]:
# to sql
df

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,...,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
0,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,1,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
1,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,2,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
2,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,3,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,4,0,4,,,
3,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,4,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
4,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,5,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,2,2,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179073,11415,2,Chennai Super Kings,Mumbai Indians,20,2,RA Jadeja,SR Watson,SL Malinga,0,...,0,0,0,0,1,0,1,,,
179074,11415,2,Chennai Super Kings,Mumbai Indians,20,3,SR Watson,RA Jadeja,SL Malinga,0,...,0,0,0,0,2,0,2,,,
179075,11415,2,Chennai Super Kings,Mumbai Indians,20,4,SR Watson,RA Jadeja,SL Malinga,0,...,0,0,0,0,1,0,1,SR Watson,run out,KH Pandya
179076,11415,2,Chennai Super Kings,Mumbai Indians,20,5,SN Thakur,RA Jadeja,SL Malinga,0,...,0,0,0,0,2,0,2,,,


In [48]:
import pymysql
from sqlalchemy import create_engine

In [49]:

# engine = create_engine('mysql://username:password@hostname:port/database')
engine = create_engine("mysql+pymysql://root:@localhost/ipl")
# {root}:{password}@{url}/{database}
df.to_sql('ipl_delivery', con=engine, if_exists='append')

179078

In [50]:
temp_df.head()

Unnamed: 0,batsman,batsman_runs
486,V Kohli,5434
428,SK Raina,5415
367,RG Sharma,4914
112,DA Warner,4741
392,S Dhawan,4632


In [51]:
temp_df.to_sql('batsman_runs', con=engine, if_exists='append')

516

In [52]:
six_df = df.query('batsman_runs == 6').pivot_table(index='over', columns='ball', values='batsman_runs', aggfunc='count')

six_df.to_sql('six_heatmap', con=engine, if_exists='append')

20

In [53]:
import requests

# url = "https://api.themoviedb.org/3/discover/movie?include_adult=false&include_video=false&language=en-US&page=1&sort_by=popularity.desc"
# 
# headers = {
#     "accept": "application/json",
#     "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiIzMzZiNDUzYTM5ZDY1NDE3MWNmMzEyYmEwODk2NjBlMyIsIm5iZiI6MTczMDc3MDA5Ni4zNzA3NjY5LCJzdWIiOiI2NzI5NzJlMzQyYmVjNDk4Nzc4MDMyMjUiLCJzY29wZXMiOlsiYXBpX3JlYWQiXSwidmVyc2lvbiI6MX0.MiF7sf232ssQGB_OW0f5127d7B2bebiXEi8AbK-_g4o"
# }
# 
# response = requests.get(url, headers=headers)
# 
# print(response.text)

id

title

release_date

overview

popularity

vote_average

vote_count

total_pages : 46940 x 7 = 328580

total_results : 938793 x 7 = 6571551



In [54]:
api_url = "https://api.themoviedb.org/3/discover/movie?include_adult=false&include_video=false&language=en-US&page=1&sort_by=popularity.desc"

headers = {
    "accept": "application/json",
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiIzMzZiNDUzYTM5ZDY1NDE3MWNmMzEyYmEwODk2NjBlMyIsIm5iZiI6MTczMDc3MDA5Ni4zNzA3NjY5LCJzdWIiOiI2NzI5NzJlMzQyYmVjNDk4Nzc4MDMyMjUiLCJzY29wZXMiOlsiYXBpX3JlYWQiXSwidmVyc2lvbiI6MX0.MiF7sf232ssQGB_OW0f5127d7B2bebiXEi8AbK-_g4o"
}

response = requests.get(api_url, headers=headers)
response.json()

{'page': 1,
 'results': [{'adult': False,
   'backdrop_path': '/gMQibswELoKmB60imE7WFMlCuqY.jpg',
   'genre_ids': [27, 53],
   'id': 1034541,
   'original_language': 'en',
   'original_title': 'Terrifier 3',
   'overview': "Five years after surviving Art the Clown's Halloween massacre, Sienna and Jonathan are still struggling to rebuild their shattered lives. As the holiday season approaches, they try to embrace the Christmas spirit and leave the horrors of the past behind. But just when they think they're safe, Art returns, determined to turn their holiday cheer into a new nightmare. The festive season quickly unravels as Art unleashes his twisted brand of terror, proving that no holiday is safe.",
   'popularity': 7119.469,
   'poster_path': '/63xYQj1BwRFielxsBDXvHIJyXVm.jpg',
   'release_date': '2024-10-09',
   'title': 'Terrifier 3',
   'video': False,
   'vote_average': 7.283,
   'vote_count': 572},
  {'adult': False,
   'backdrop_path': '/3V4kLQg0kSqPLctI5ziYWabAZYF.jpg',
   'gen

In [55]:
df = pd.DataFrame(response.json()['results'])
print(df.shape)
df.head()

(20, 14)


Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/gMQibswELoKmB60imE7WFMlCuqY.jpg,"[27, 53]",1034541,en,Terrifier 3,Five years after surviving Art the Clown's Hal...,7119.469,/63xYQj1BwRFielxsBDXvHIJyXVm.jpg,2024-10-09,Terrifier 3,False,7.283,572
1,False,/3V4kLQg0kSqPLctI5ziYWabAZYF.jpg,"[878, 28, 12]",912649,en,Venom: The Last Dance,Eddie and Venom are on the run. Hunted by both...,6160.532,/6f9mRtz5QHvYpvDRxnsrfSVW4Pu.jpg,2024-10-22,Venom: The Last Dance,False,6.713,458
2,False,/4zlOPT9CrtIX05bBIkYxNZsm5zN.jpg,"[16, 878, 10751]",1184918,en,The Wild Robot,"After a shipwreck, an intelligent robot called...",5275.345,/wTnV3PCVW5O92JMrFvvrRcV39RU.jpg,2024-09-12,The Wild Robot,False,8.542,2343
3,False,/9oYdz5gDoIl8h67e3ccv3OHtmm2.jpg,"[18, 27, 878]",933260,en,The Substance,Have you ever dreamt of a better version of yo...,3439.676,/lqoMzCcZYEFK729d6qzt349fB4o.jpg,2024-09-07,The Substance,False,7.322,1343
4,False,/oPUOpnl3pqD8wuidjfUn17mO1yA.jpg,"[16, 878, 12, 10751]",698687,en,Transformers One,The untold origin story of Optimus Prime and M...,3135.612,/iHPIBzrjJHbXeY9y7VVbEVNt7LW.jpg,2024-09-11,Transformers One,False,8.141,516


In [56]:
temp_df = df[['id', 'title', 'overview', 'release_date', 'popularity', 'vote_average', 'vote_count']]
temp_df.head()

Unnamed: 0,id,title,overview,release_date,popularity,vote_average,vote_count
0,1034541,Terrifier 3,Five years after surviving Art the Clown's Hal...,2024-10-09,7119.469,7.283,572
1,912649,Venom: The Last Dance,Eddie and Venom are on the run. Hunted by both...,2024-10-22,6160.532,6.713,458
2,1184918,The Wild Robot,"After a shipwreck, an intelligent robot called...",2024-09-12,5275.345,8.542,2343
3,933260,The Substance,Have you ever dreamt of a better version of yo...,2024-09-07,3439.676,7.322,1343
4,698687,Transformers One,The untold origin story of Optimus Prime and M...,2024-09-11,3135.612,8.141,516


We will do this entire process in a loop where loop will run for number of pages, i.e. 46940 times, and we will create a df and append this to the main df. 

Finally we will have a big df with 938793 movies

In [57]:
df = pd.DataFrame()
df

In [58]:
counter = 1

for i in range(1, 500):
    response = requests.get(
        f"https://api.themoviedb.org/3/discover/movie?include_adult=false&include_video=false&language=en-US&page={i}&sort_by=popularity.desc",
        headers=headers)
    temp_df = pd.DataFrame(response.json()['results'])
    temp_df = temp_df[
        ['id', 'title', 'overview', 'release_date', 'popularity', 'vote_average', 'vote_count']]
    df = pd.concat([df, temp_df], ignore_index=True)

In [59]:
df.shape

(9980, 7)

In [60]:
df.to_csv('movies.csv')