# Data Cleaning of Datasets
In this file you will find the process of Data Cleaning of the datasets. In total there are __ (number of datasets) of datasets.

### ***File: movies.csv***

In [38]:
#Import libraries needed for data cleaning.
import pandas as pd
import numpy as np

In [43]:
#movie1 is movies.csv
movie1 = pd.read_csv('../Database/movies.csv', sep = ';', encoding='latin-1')

In [44]:
#Explored Data
movie1.head()

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,score,star,votes,writer,year
0,8000000.0,Columbia Pictures Corporation,USA,Rob Reiner,Adventure,52287414.0,Stand by Me,R,22/8/86,89,8.1,Wil Wheaton,299174,Stephen King,1986
1,6000000.0,Paramount Pictures,USA,John Hughes,Comedy,70136369.0,Ferris Bueller's Day Off,PG-13,11/6/86,103,7.8,Matthew Broderick,264740,John Hughes,1986
2,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,Top Gun,PG,16/5/86,110,6.9,Tom Cruise,236909,Jim Cash,1986
3,18500000.0,Twentieth Century Fox Film Corporation,USA,James Cameron,Action,85160248.0,Aliens,R,18/7/86,137,8.4,Sigourney Weaver,540152,James Cameron,1986
4,9000000.0,Walt Disney Pictures,USA,Randal Kleiser,Adventure,18564613.0,Flight of the Navigator,PG,1/8/86,90,6.9,Joey Cramer,36636,Mark H. Baker,1986


### Let's start with Data Cleaning...

#### 1st step: Check data types, correct them, check for nulls and change column names, if needed.

In [45]:
movie1.dtypes #Checking data types.

budget      float64
company      object
country      object
director     object
genre        object
gross       float64
name         object
rating       object
released     object
runtime       int64
score       float64
star         object
votes         int64
writer       object
year          int64
dtype: object

After checking data types, we can see that DataTypes are correct, no need to change them.

In [46]:
movie1.isnull().sum().sum() #Check overall amount of null values

0

No nulls in dataset

In [47]:
movie1.columns #Checking column names

Index(['budget', 'company', 'country', 'director', 'genre', 'gross', 'name',
       'rating', 'released', 'runtime', 'score', 'star', 'votes', 'writer',
       'year'],
      dtype='object')

Column names are correct and descriptive so we don't have to change them.

#### 2nd step: Drop columns that are not going to be used

We won't be using runtime, star, writer, country, director, released and rating

In [48]:
movie1.drop(columns = ['country', 'director', 'rating', 'released', 'runtime', 'star', 'writer'], inplace = True)
#Columns deleted

In [49]:
movie1.head()

Unnamed: 0,budget,company,genre,gross,name,score,votes,year
0,8000000.0,Columbia Pictures Corporation,Adventure,52287414.0,Stand by Me,8.1,299174,1986
1,6000000.0,Paramount Pictures,Comedy,70136369.0,Ferris Bueller's Day Off,7.8,264740,1986
2,15000000.0,Paramount Pictures,Action,179800601.0,Top Gun,6.9,236909,1986
3,18500000.0,Twentieth Century Fox Film Corporation,Action,85160248.0,Aliens,8.4,540152,1986
4,9000000.0,Walt Disney Pictures,Adventure,18564613.0,Flight of the Navigator,6.9,36636,1986


Organizing of the columns

In [50]:
movie1[['name', 'company', 'year', 'genre', 'votes', 'score', 'budget', 'gross']]

Unnamed: 0,name,company,year,genre,votes,score,budget,gross
0,Stand by Me,Columbia Pictures Corporation,1986,Adventure,299174,8.1,8000000.0,52287414.0
1,Ferris Bueller's Day Off,Paramount Pictures,1986,Comedy,264740,7.8,6000000.0,70136369.0
2,Top Gun,Paramount Pictures,1986,Action,236909,6.9,15000000.0,179800601.0
3,Aliens,Twentieth Century Fox Film Corporation,1986,Action,540152,8.4,18500000.0,85160248.0
4,Flight of the Navigator,Walt Disney Pictures,1986,Adventure,36636,6.9,9000000.0,18564613.0
...,...,...,...,...,...,...,...,...
6802,Absolutely Fabulous: The Movie,Fox Searchlight Pictures,2016,Comedy,9161,5.4,0.0,4750497.0
6803,Mothers and Daughters,Siempre Viva Productions,2016,Drama,1959,4.9,0.0,28368.0
6804,Batman: The Killing Joke,Warner Bros. Animation,2016,Animation,36333,6.5,3500000.0,3775000.0
6805,The Eyes of My Mother,Borderline Presents,2016,Drama,6947,6.2,0.0,25981.0


In [51]:
movie1.to_csv('../Database/Clean/movie_clean.csv') #Saving in a csv file to proceed to the analysis part.