<a href="https://colab.research.google.com/drive/1tlpM-sqKa2plNeOaKiuUsaNVQbTTKyBR?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial needs data so if you are working on colab follow the below data setup instruction

# Data Setup Instructions

These are the instructions for mounting the data from google drive to colab and accessing it in the colab.

STEP 1 - After opening the tutorial in  your colab, go to folder button and click on mount google drive

STEP 2 - drive folder will be mounted in the current directory of /content, you can access it as below 

In [1]:
# print current directory
%pwd

'/content'

In [2]:
%ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


STEP 3 - Find your data folder where you saved the data and sym link it to /content folder so as to simplify data access

In the current case the Data folder is located at this path in google drive (Use your own data path in your case)

/content/drive/Othercomputers/My MacBook Pro/Data/

We can sym link it to /content folder using the following command

In [3]:
# sym linked the original data folder to new location at /content
!ln -s "/content/drive/Othercomputers/My MacBook Pro/Data" "/content"

Now we can access the data from this folder by simply giving the file path name after /Data

# Importing pandas library and data loading

In [4]:
import pandas as pd

In the last lesson we have saved the cleaned dataframe at the end in the file 

'movies_cleaned_lesson2.csv'.

You can read this file in the below way (the path is taken from the previous lesson)

In [5]:
# if you are working with this tutorial on local machine use the file path where the data is saved in your computer
movies = pd.read_csv("Data/IMDB_rotten_tomato_dataset/IMDB/cleaned_files/movies_cleaned_lesson2.csv")
# We can use .head command to quickly observe the first 5 rows of the dataset
movies.head()

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age
0,tt0000009,Miss Jerry,1894,1894-10-09,Romance,45,USA,,5.9,154,,,,,127
1,tt0000574,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,6.1,589,$ 2250,,,,115
2,tt0001892,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,5.8,188,,,,,110
3,tt0002101,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,5.2,446,$ 45000,,,,109
4,tt0002130,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,7.0,2237,,,,,110


### Handling missing values

If you will observe the worldwide_gross_income column, there are many missing values (missing values are denoted by NaN)

In [6]:
movies['worldwide_gross_income']

0              NaN
1              NaN
2              NaN
3              NaN
4              NaN
           ...    
85849    $ 3507171
85850    $ 7299062
85851          NaN
85852       $ 2833
85853      $ 59794
Name: worldwide_gross_income, Length: 85854, dtype: object

One of the basic data cleaning step that can be done on any dataset is to find missing values in the data. If this data can not be filled up, then it should be removed. 

In our present case,we will create a separate dataset where we do not have any missing value in worldwide_gross_income column. This will help in working with income data of movies separately.

We will be using .copy() command at the end of this command because it will create a completely new dataframe in this case rather than just a reference view of the dataframe. 

This will help us tackle 'SettingswithCopy' warning which may happen when we start working on this new dataframe later on. (This is a very frequent kind of warning in pandas library, we will learn more about it during the bigger course of OneLearn GetHired Program)

In [7]:
#.notnull() is the function used to filter movies which has some gross_income mentioned in data
movies_cleaned_gross_income = movies.loc[movies['worldwide_gross_income'].notnull()].copy()

In [8]:
movies_cleaned_gross_income

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age
79,tt0007183,Pikovaya dama,1916,1916-04-01,"Drama, Fantasy, Horror",63,Russia,Russian,7.0,610,,,$ 144968,,105
165,tt0010323,Das Cabinet des Dr. Caligari,1920,1920-02-27,"Fantasy, Horror, Mystery",76,Germany,German,8.1,55601,$ 18000,$ 8811,$ 8811,,101
210,tt0011440,Markens grøde,1921,1921-12-02,Drama,107,Norway,,6.6,195,NOK 250000,,$ 4272,,100
222,tt0011741,Suds,1920,1920-01-27,"Comedy, Drama, Romance",75,USA,English,6.3,210,,,$ 772155,,101
245,tt0012190,The Four Horsemen of the Apocalypse,1921,1923-04-16,"Drama, Romance, War",150,USA,,7.2,3058,$ 800000,$ 9183673,$ 9183673,,100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85846,tt9905412,Ottam,2019,2019-03-08,Drama,120,India,Malayalam,7.4,494,INR 4000000,,$ 4791,,2
85849,tt9908390,Le lion,2020,2020-01-29,Comedy,95,"France, Belgium",French,5.3,398,,,$ 3507171,,1
85850,tt9911196,De Beentjes van Sint-Hildegard,2020,2020-02-13,"Comedy, Drama",103,Netherlands,"German, Dutch",7.7,724,,,$ 7299062,,1
85852,tt9914286,Sokagin Çocuklari,2019,2019-03-15,"Drama, Family",98,Turkey,Turkish,6.4,194,,,$ 2833,,2


You can observe the number of rows in this data frame is 31016 which is quite less than the original numnber of rows in initial dataframe. But since in the bigger project we have to analyse earning potential of movies so we will have to remove missing values of worldwide_gross_income.

# Saving the cleaned data

In this lesson we have not modified the initial dataframe movies, so we will not save it. But we have created a new dataframe 'movies_cleaned_gross_income'

We will save the dataframe 'movies_cleaned_gross_income' in a new file called 

'movies_cleaned_lesson3.csv'

in the folder cleaned_files.

Final path of the saved file would be - 'Data/IMDB_rotten_tomato_dataset/IMDB/cleaned_files/movies_cleaned_lesson3.csv'

In [9]:
# don't forget to put index = false while saving the data frame in a csv file
movies_cleaned_gross_income.to_csv('Data/IMDB_rotten_tomato_dataset/IMDB/cleaned_files/movies_cleaned_lesson3.csv', index = False)

We will use this file in the next lesson of Data Cleaning.