<a href="https://colab.research.google.com/github/Aya-Jafar/Python/blob/main/Imputation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('MovieAssignmentData.csv')
df.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [3]:
print("Number of rows and columns :",df.shape)

Number of rows and columns : (5043, 28)


# Checking null values in the dataset

In [4]:
df.isnull().sum()

color                         19
director_name                104
num_critic_for_reviews        50
duration                      15
director_facebook_likes      104
actor_3_facebook_likes        23
actor_2_name                  13
actor_1_facebook_likes         7
gross                        884
genres                         0
actor_1_name                   7
movie_title                    0
num_voted_users                0
cast_total_facebook_likes      0
actor_3_name                  23
facenumber_in_poster          13
plot_keywords                153
movie_imdb_link                0
num_user_for_reviews          21
language                      12
country                        5
content_rating               303
budget                       492
title_year                   108
actor_2_facebook_likes        13
imdb_score                     0
aspect_ratio                 329
movie_facebook_likes           0
dtype: int64

## drop columns with the highest null values 

If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values. (It's not generally recommended for important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.)


In [5]:
df.drop(['gross','budget','aspect_ratio','director_facebook_likes','title_year','content_rating','plot_keywords'],axis=1,inplace=True)
df.isnull().sum()

color                         19
director_name                104
num_critic_for_reviews        50
duration                      15
actor_3_facebook_likes        23
actor_2_name                  13
actor_1_facebook_likes         7
genres                         0
actor_1_name                   7
movie_title                    0
num_voted_users                0
cast_total_facebook_likes      0
actor_3_name                  23
facenumber_in_poster          13
movie_imdb_link                0
num_user_for_reviews          21
language                      12
country                        5
actor_2_facebook_likes        13
imdb_score                     0
movie_facebook_likes           0
dtype: int64

# Imputation 
which means filling the missing values 

The benefit of using imputation, is that it prevents data loss espesielly if the data set is small.

filling can be with either *mean,medien,mode* 

The imputed value won't be exactly right in most cases, but it usually gives more accurate models than dropping the column entirely.

### Imputing using the avarage (the mean) of the numeric data set

In [6]:
# Select the numric data set to impute it 
numeric_df = df.select_dtypes(include=[np.number])
numeric_df.isnull().sum()

num_critic_for_reviews       50
duration                     15
actor_3_facebook_likes       23
actor_1_facebook_likes        7
num_voted_users               0
cast_total_facebook_likes     0
facenumber_in_poster         13
num_user_for_reviews         21
actor_2_facebook_likes       13
imdb_score                    0
movie_facebook_likes          0
dtype: int64

In [7]:
for i in numeric_df.columns:
  print("Null values befor imputing =",df[i].isnull().sum()) 

Null values befor imputing = 50
Null values befor imputing = 15
Null values befor imputing = 23
Null values befor imputing = 7
Null values befor imputing = 0
Null values befor imputing = 0
Null values befor imputing = 13
Null values befor imputing = 21
Null values befor imputing = 13
Null values befor imputing = 0
Null values befor imputing = 0


In [8]:
for i in numeric_df.columns: # iterate through each value in the numeric data set
    if numeric_df[i].isnull().sum() > 0:
      df[i].fillna(numeric_df[i].mean(),inplace=True) # replace the null value with the mean of that column and store it in the origianl data set
      print("Null values after imputing =",df[i].isnull().sum()) 

Null values after imputing = 0
Null values after imputing = 0
Null values after imputing = 0
Null values after imputing = 0
Null values after imputing = 0
Null values after imputing = 0
Null values after imputing = 0


## Imputing using sklearn



###  using IterativeImputer  
IterativeImputer treats the missing values as if it's the target value so it is trained on the other data set and tries to make prediction for the missing values 

  [have a look to the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html)


In [9]:
# explicitly require this experimental feature
from sklearn.experimental import enable_iterative_imputer
# now you can import normally from sklearn.impute
from sklearn.impute import IterativeImputer

In [10]:
impute_iterativaly = IterativeImputer()

In [11]:
impute_iterativaly.fit_transform(numeric_df)

array([[7.23e+02, 1.78e+02, 8.55e+02, ..., 9.36e+02, 7.90e+00, 3.30e+04],
       [3.02e+02, 1.69e+02, 1.00e+03, ..., 5.00e+03, 7.10e+00, 0.00e+00],
       [6.02e+02, 1.48e+02, 1.61e+02, ..., 3.93e+02, 6.80e+00, 8.50e+04],
       ...,
       [1.30e+01, 7.60e+01, 0.00e+00, ..., 0.00e+00, 6.30e+00, 1.60e+01],
       [1.40e+01, 1.00e+02, 4.89e+02, ..., 7.19e+02, 6.30e+00, 6.60e+02],
       [4.30e+01, 9.00e+01, 1.60e+01, ..., 2.30e+01, 6.60e+00, 4.56e+02]])

### Imputing using SimpleImputer

In [12]:
from sklearn.impute import SimpleImputer 

In [26]:
# Impute the catogarical data using the mode (most_frequent) 
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent') 
imp.fit_transform( df.select_dtypes(exclude=[np.number])) 

array([['Color', 'James Cameron', 'Joel David Moore', ...,
        'http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1',
        'English', 'USA'],
       ['Color', 'Gore Verbinski', 'Orlando Bloom', ...,
        'http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1',
        'English', 'USA'],
       ['Color', 'Sam Mendes', 'Rory Kinnear', ...,
        'http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1',
        'English', 'UK'],
       ...,
       ['Color', 'Benjamin Roberds', 'Maxwell Moody', ...,
        'http://www.imdb.com/title/tt2107644/?ref_=fn_tt_tt_1',
        'English', 'USA'],
       ['Color', 'Daniel Hsia', 'Daniel Henney', ...,
        'http://www.imdb.com/title/tt2070597/?ref_=fn_tt_tt_1',
        'English', 'USA'],
       ['Color', 'Jon Gunn', 'Brian Herzlinger', ...,
        'http://www.imdb.com/title/tt0378407/?ref_=fn_tt_tt_1',
        'English', 'USA']], dtype=object)

### Imputing using KNN Imputer

In [22]:
from sklearn.impute import KNNImputer

In [25]:
knn_impu = KNNImputer(n_neighbors=2)
knn_impu.fit_transform(numeric_df)

array([[7.23e+02, 1.78e+02, 8.55e+02, ..., 9.36e+02, 7.90e+00, 3.30e+04],
       [3.02e+02, 1.69e+02, 1.00e+03, ..., 5.00e+03, 7.10e+00, 0.00e+00],
       [6.02e+02, 1.48e+02, 1.61e+02, ..., 3.93e+02, 6.80e+00, 8.50e+04],
       ...,
       [1.30e+01, 7.60e+01, 0.00e+00, ..., 0.00e+00, 6.30e+00, 1.60e+01],
       [1.40e+01, 1.00e+02, 4.89e+02, ..., 7.19e+02, 6.30e+00, 6.60e+02],
       [4.30e+01, 9.00e+01, 1.60e+01, ..., 2.30e+01, 6.60e+00, 4.56e+02]])