<a href="https://colab.research.google.com/github/DLPY/Regression-Session-1/blob/master/EDA_using_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis using Python

## 1. Packages in Python

### Background : 

A python package is a collection of modules that can be used to provide additional functions and features in a program. When a module from an external package is required in a program, that package can be imported and its modules can be put to use.

A custom python package can be created by a user, however there are various packages already available that provide users various useful functions.

Some popular python packages and their use:
1. Pandas - Provides access to efficient data structures for structured and time-series data
2. Numpy - Provides access to N-dimensional arrays and other useful numerical tools
3. Matplotlib -Helps developers create stunning visualizations
4. SciPy - Provides tools and libraries for mathematical, engineering, and scientific calculations
5. Scikit-Learn - Provides various funtions to implement machine learning and data mining tasks
6. Beautiful Soup - Scrapes all or specified data from web pages
7. TensorFlow - Provides the necessary tools for Machine Learning projects

In [3]:
#importing the packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## 2. Loading your data

The next step in the process is to load the data in python. The following different functions in pandas help in reading different files
1. CSV file - pd.read_csv()
2. MS Excel - pd.read_excel()
3. JSON - pd.read_json()
4. HTML - pd.read_html()

In [1]:
# CSV is first read in from a github raw file another option is to import the notebook to your session storage by click on the file icon on left toolbar then importing csv
! wget https://raw.githubusercontent.com/DLPY/Regression-Session-1/master/Data/netflix_titles.csv

--2021-11-16 07:52:50--  https://raw.githubusercontent.com/DLPY/Regression-Session-1/master/Data/netflix_titles.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3399671 (3.2M) [text/plain]
Saving to: ‘netflix_titles.csv’


2021-11-16 07:52:51 (44.8 MB/s) - ‘netflix_titles.csv’ saved [3399671/3399671]



In [4]:
#reading the input file
netflix_df = pd.read_csv('netflix_titles.csv',sep=",")

## 3. Basic Data Exploration

In this step we will perform some basic operations to check what the dataset comprises of. The following funtions will be used:
    
1. head()/tail() - The head/tail function will tell you the top/bottom records in the data set. By default, python shows you only the top 5/bottom 5 records.

2. shape - The shape attribute tells us a number of records and features we have in the data set. It is used to check the dimension of data.

3. info() - The info funtion provides us the Information about the data and the datatypes of each respective attribute.

4. describe() - The described method will help to see how data has been spread for numerical values e.g - the minimum value, mean values, different percentile values, and maximum values.

In [5]:
# The top 5 records
netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [8]:
# The bottom 10 records
netflix_df.tail(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8797,s8798,TV Show,Zak Storm,,"Michael Johnston, Jessica Gee-George, Christin...","United States, France, South Korea, Indonesia","September 13, 2018",2016,TV-Y7,3 Seasons,Kids' TV,Teen surfer Zak Storm is mysteriously transpor...
8798,s8799,Movie,Zed Plus,Chandra Prakash Dwivedi,"Adil Hussain, Mona Singh, K.K. Raina, Sanjay M...",India,"December 31, 2019",2014,TV-MA,131 min,"Comedies, Dramas, International Movies",A philandering small-town mechanic's political...
8799,s8800,Movie,Zenda,Avadhoot Gupte,"Santosh Juvekar, Siddharth Chandekar, Sachit P...",India,"February 15, 2018",2009,TV-14,120 min,"Dramas, International Movies",A change in the leadership of a political part...
8800,s8801,TV Show,Zindagi Gulzar Hai,,"Sanam Saeed, Fawad Khan, Ayesha Omer, Mehreen ...",Pakistan,"December 15, 2016",2012,TV-PG,1 Season,"International TV Shows, Romantic TV Shows, TV ...","Strong-willed, middle-class Kashaf and carefre..."
8801,s8802,Movie,Zinzana,Majid Al Ansari,"Ali Suliman, Saleh Bakri, Yasa, Ali Al-Jabri, ...","United Arab Emirates, Jordan","March 9, 2016",2015,TV-MA,96 min,"Dramas, International Movies, Thrillers",Recovering alcoholic Talal wakes up inside a s...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


## Dataset

netflix_titles.csv: The csv file contains information about the various movies and the data related to them.

1. Show ID - unique ID of that particular show
2. Type - type of the video - movie, TV Series etc.
3. Title - title of the video
4. Director - director name
5. Cast - cast members
6. Country - country where content was produced.
7. Data Added - date when it became live on netflix
8. Release Year - year of release
9. Rating - user rating
10. Duration - duration of the movie, TV Series etc.
11. Listed in - Genre information
12. Description - concise plot of the series

In [None]:
#The number of records and columns in the dataset : 8807 rows and 12 columns
print(netflix_df.shape)

(8807, 12)


In [6]:
#basic information regarding the column datatypes
netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [7]:
#The analysis of the numeric variables in the dataset. Only release year is numeric in this dataset.
netflix_df.describe()
#This shows we have content on netflix from 1925-2021 in this dataset

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


## 4. Train-Test, Split

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

It is a fast and easy procedure to perform, the results of which allow you to compare the performance of machine learning algorithms for your predictive modeling problem. Although simple to use and interpret, there are times when the procedure should not be used, such as when you have a small dataset and situations where additional configuration is required, such as when it is used for classification and the dataset is not balanced.

The train-test split procedure is appropriate when you have a very large dataset, a costly model to train, or require a good estimate of model performance quickly.

We will use the scikit-learn machine learning library to perform the train-test split procedure.



In [10]:
# Independent Variable
X = netflix_df.drop(['release_year'],axis=1).values

# Depenedent Variable
y = netflix_df.release_year.values

# Split Observations in 80% training set 20% test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state=21)

In [13]:
X_train.shape

(7045, 11)

In [14]:
X_test.shape

(1762, 11)

In [15]:
y_train.shape

(7045,)

In [16]:
y_test.shape

(1762,)

In [18]:
y_test

array([2019, 2019, 2012, ..., 2013, 2009, 2018])