# **Questions/Answers**
### 1. **How do you deal with missing data? Explain all the possible situations.**

**Ans.1** <br/>
Dealing with missing data is a critical step in data analysis and modeling. Missing data can occur for various reasons, and how you handle it depends on the nature of the data and the specific context of your analysis. Here are some common strategies for dealing with missing data, along with explanations for different situations:

a) **Ignoring Missing Data**:
   - **Situation**: In some cases, missing data may be minimal or occur randomly, and it may not significantly impact the analysis or modeling. In such situations, you can choose to ignore missing values.
   - **Explanation**: Ignoring missing data is a valid approach if the missingness is assumed to be completely random and not related to the variables being studied. However, this approach may lead to a loss of information and potentially biased results if missing data is not random.

b) **Listwise Deletion**:
   - **Situation**: When dealing with a small amount of missing data and it's reasonable to remove entire rows (samples) with missing values.
   - **Explanation**: This approach is simple but can lead to a loss of data, reducing the sample size and potentially affecting the representativeness of the remaining data.

c) **Imputation**:
   - **Situation**: Missing data is common, and you want to retain as much data as possible while providing estimates for the missing values.
   - **Explanation**: Imputation involves replacing missing values with estimated values based on the available data. Common imputation methods include mean imputation (replacing with the mean of the variable), median imputation, mode imputation, or more advanced techniques like regression imputation or k-nearest neighbors imputation.

d) **Forward Fill or Backward Fill (Time Series Data)**:
   - **Situation**: In time series data, missing values can often be filled with the most recent non-missing value (forward fill) or the next non-missing value (backward fill).
   - **Explanation**: These methods make sense when missing data points are expected to be similar to nearby observations in a time series.

e) **Interpolation**:
   - **Situation**: For continuous data, you can use interpolation methods such as linear interpolation, spline interpolation, or time-based interpolation to estimate missing values.
   - **Explanation**: Interpolation assumes that the missing values follow a pattern or trend in the data and fills them in a way that fits this pattern.

f) **Multiple Imputation**:
   - **Situation**: When missing data is not missing completely at random and you want to account for uncertainty in imputed values.
   - **Explanation**: Multiple Imputation involves creating multiple datasets with imputed values and analyzing each dataset separately. The results are then combined to provide more accurate estimates and standard errors.

g) **Domain-Specific Methods**:
   - **Situation**: In some cases, domain-specific knowledge can help in handling missing data. For example, in medical research, certain missing values might be treated differently based on clinical expertise.
   - **Explanation**: Domain knowledge can guide imputation methods or inform the decision on whether to ignore or impute missing data.

h) **Data Collection Improvement**:
   - **Situation**: If missing data is a recurrent problem, consider improving data collection processes to reduce the occurrence of missing values.
   - **Explanation**: Prevention is often the best strategy. Better data collection techniques, data validation, and data entry procedures can minimize the amount of missing data in the first place.

Choosing the appropriate method for handling missing data depends on the specific characteristics of your dataset and the objectives of your analysis. It's crucial to carefully assess the reasons for missing data and consider the potential impact of each strategy on your results. No single approach is suitable for all situations, so a combination of methods or sensitivity analyses may be required to ensure robust and accurate results.

### 2. **Explain about label and one hot encoding with an example.**

**Ans.2** <br/>
Label encoding and one-hot encoding are two techniques commonly used to represent categorical data in a format that machine learning algorithms can understand. Let's explore each of them with examples:

**Label Encoding:**

Label encoding is a method of converting categorical data into numerical form, where each unique category is assigned a unique integer label. This encoding assumes an ordinal relationship between categories, meaning it implies that one category is somehow greater or more significant than another. This can lead to problems when the categories do not have any ordinal relationship.

**Example:**
Consider a dataset of car sizes with three categories: Small, Medium, and Large. Using label encoding, you might represent these categories as:

- Small: 0
- Medium: 1
- Large: 2

In this case, it implies that "Large" is greater than "Medium," and "Medium" is greater than "Small." If this ordinal relationship doesn't exist in your data, label encoding can be misleading for machine learning algorithms.

**One-Hot Encoding:**

One-hot encoding is a method of converting categorical data into a binary matrix (0s and 1s), where each category becomes a separate binary feature. Each category is represented as a binary vector, with a 1 in the column corresponding to the category and 0s in all other columns. This approach doesn't assume any ordinal relationship between categories and is especially useful when dealing with nominal categorical variables (categories with no inherent order).

**Example:**
Using the same car size dataset with one-hot encoding:

- Small: [1, 0, 0]
- Medium: [0, 1, 0]
- Large: [0, 0, 1]

Here, each category is represented as a binary vector, and the absence of a 1 in a column indicates that the corresponding category is not present. One-hot encoding prevents misinterpretation of ordinal relationships and ensures that the machine learning algorithm treats all categories equally.

However, it's important to note that one-hot encoding can increase the dimensionality of your dataset significantly, which might not be suitable for datasets with a large number of categories. In such cases, techniques like feature selection or dimensionality reduction may be necessary to manage the high-dimensional data effectively.

In summary, label encoding is suitable for ordinal categorical data, while one-hot encoding is preferred for nominal categorical data, as it avoids introducing unintended ordinal relationships and provides a clear representation for machine learning algorithms. The choice between these encoding methods should be made based on the nature of your categorical data and the requirements of your machine learning task.

### 3. **What are the different kind of normalization used often? Give the formula for normalizing the values.**

**Ans.3** <br/>
Normalization is a data preprocessing technique used to scale numeric features in a dataset to a standard range, typically between 0 and 1, or sometimes -1 and 1. Normalization helps to make the features more comparable and prevents some features from dominating others due to differences in scale. There are several common methods for normalization, each with its own formula. Here are three frequently used methods:

a) **Min-Max Scaling (Normalization):**

   Min-Max scaling scales the values of a feature to a specified range, usually [0, 1]. The formula for Min-Max scaling is as follows:

   For a feature x:
   
   X<sub>normalized</sub> = (X - X<sub>min</sub>) / (X<sub>max</sub> - X<sub>min</sub>)


   Where:
   - (X) is the original value of the feature.
   - (X<sub>min</sub>) is the minimum value of the feature in the dataset.
   - (X<sub>max</sub>) is the maximum value of the feature in the dataset.

   This formula ensures that the scaled values fall within the desired range [0, 1].

b) **Z-Score Standardization (Standardization):**

   Z-score standardization, also known as standardization, scales the values of a feature to have a mean (\(μ\)) of 0 and a standard deviation (\(σ\)) of 1. This is useful when you want to center the data around zero and have it in units of standard deviations. The formula for Z-score standardization is as follows:

   For a feature x:

   X<sub>standardized</sub> = (X - μ) / σ

   Where:
   - (X) is the original value of the feature.
   - (μ) is the mean (average) of the feature values in the dataset.
   - (σ) is the standard deviation of the feature values in the dataset.

   Standardization transforms the data into a standard normal distribution.

c) **Robust Scaling:**

   Robust scaling is a variation of scaling that is less affected by outliers compared to Min-Max scaling or Z-score standardization. It uses the median (\(m\)) and the interquartile range (IQR) to scale the data. The formula for robust scaling is as follows:

   For a feature x:

   X<sub>scaled</sub> = (X - m) / (IQR)

   Where:
   - (X) is the original value of the feature.
   - (m) is the median of the feature values in the dataset.
   - (IQR) is the interquartile range, calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the feature values.

   Robust scaling is particularly useful when dealing with data that contains outliers.

These normalization methods help in preparing data for various machine learning algorithms and statistical analyses by ensuring that features are on similar scales. The choice of which normalization method to use depends on the specific characteristics of your data and the requirements of your analysis or model.

### 4. **"Data cleaning, Scrubbing and Normalization became over 70% of a data scientist's job" - Explain why?**

**Ans.4** <br/>
The statement that "Data cleaning, scrubbing, and normalization became over 70% of a data scientist's job" reflects the reality of the data science field. Several factors contribute to this phenomenon:

a) **Data Quality Issues**: Real-world data is often messy and imperfect. It can contain missing values, outliers, inconsistencies, and errors. Before any meaningful analysis or modeling can occur, data scientists must invest a substantial amount of time and effort in cleaning and preparing the data. Poor data quality can lead to incorrect insights and inaccurate models, making data cleaning crucial.

b) **Data Volume**: With the advent of big data, the volume of data that organizations collect and store has grown exponentially. Handling large datasets exacerbates data quality issues and increases the time and resources needed for data cleaning and preprocessing.

c) **Data Variety**: Data comes in various formats, including structured, semi-structured, and unstructured data. Each type of data requires different cleaning and preprocessing techniques. Data scientists need to adapt their approaches to handle this variety effectively.

d) **Data Sources**: Data is often collected from multiple sources, such as databases, sensors, social media, web scraping, and more. Each source may have its own data format and quality challenges. Integrating and cleaning data from these disparate sources can be complex and time-consuming.

e) **Regulatory Compliance**: Many industries have strict regulations regarding data privacy and security, such as GDPR in Europe or HIPAA in healthcare. Data scientists must ensure that the data they work with complies with these regulations, which often involves additional data cleaning and anonymization.

f) **Data Preprocessing for Machine Learning**: Data scientists spend a significant portion of their time preparing data for machine learning models. This includes not only cleaning and scrubbing data but also transforming it into a format suitable for modeling. Features may need to be engineered, scaled, and normalized to improve model performance.

g) **Bias and Fairness**: Ensuring fairness and addressing bias in data is a critical concern in data science. Biases in data can lead to biased models, which can have ethical and legal implications. Data scientists need to identify and mitigate bias in their datasets, which involves careful examination and cleaning of data.

h) **Data Exploration and Understanding**: Before diving into modeling, data scientists must thoroughly explore and understand the data. This exploration often uncovers anomalies, patterns, and insights that necessitate data cleaning and preprocessing.

i) **Iterative Process**: Data science is often an iterative process. As data scientists build and refine models, they may discover issues with the data that require revisiting and refining the cleaning and preprocessing steps.

j) **Communication and Collaboration**: Data scientists must communicate their findings and insights effectively to non-technical stakeholders. Clean, well-prepared data is essential for conveying meaningful results and making data-driven decisions.

In summary, data cleaning, scrubbing, and normalization are critical and time-consuming steps in the data science pipeline. They ensure that the data is reliable, accurate, and suitable for analysis and modeling. Given the increasing importance of data-driven decision-making, the role of a data scientist includes a substantial focus on these tasks, often comprising a significant portion of their work.

# **Ques.** You have a project to predict the IMDB rating of a movie based on various features. But before that you realized that the data is not clean. So you need to perform following operations:

### a) Impute the column containing more than 100 NaN values with suitable central tendency. 
### b) Remove the rows with NaN values in remaining columns.
### c) Label Encode the columns 'Language', 'Country', 'Content', 'Rating'.
### d) One hot encoding the column 'Country'.

In [226]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [227]:
data = pd.read_csv("Dataset/movie_metadata.csv")

In [228]:
data

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5038,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
5039,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
5040,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
5041,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


In [229]:
nan_counts = data.isna().sum()
nan_counts

color                         19
director_name                104
num_critic_for_reviews        50
duration                      15
director_facebook_likes      104
actor_3_facebook_likes        23
actor_2_name                  13
actor_1_facebook_likes         7
gross                        884
genres                         0
actor_1_name                   7
movie_title                    0
num_voted_users                0
cast_total_facebook_likes      0
actor_3_name                  23
facenumber_in_poster          13
plot_keywords                153
movie_imdb_link                0
num_user_for_reviews          21
language                      12
country                        5
content_rating               303
budget                       492
title_year                   108
actor_2_facebook_likes        13
imdb_score                     0
aspect_ratio                 329
movie_facebook_likes           0
dtype: int64

In [230]:
columns_to_impute = nan_counts[nan_counts > 100].index
columns_to_impute

Index(['director_name', 'director_facebook_likes', 'gross', 'plot_keywords',
       'content_rating', 'budget', 'title_year', 'aspect_ratio'],
      dtype='object')

In [231]:
data[columns_to_impute].dtypes

director_name               object
director_facebook_likes    float64
gross                      float64
plot_keywords               object
content_rating              object
budget                     float64
title_year                 float64
aspect_ratio               float64
dtype: object

In [232]:
for column in columns_to_impute:
    if (data[column].dtype == 'object'):
        mode_value = data[column].mode()[0]
        data[column].fillna(mode_value, inplace=True)
    elif (data[column].dtype == 'float64'):
        median_value = data[column].mean()
        data[column].fillna(median_value, inplace=True)

In [233]:
data

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.000000,855.0,Joel David Moore,1000.0,7.605058e+08,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,2.370000e+08,2009.000000,936.0,7.9,1.780000,33000
1,Color,Gore Verbinski,302.0,169.0,563.000000,1000.0,Orlando Bloom,40000.0,3.094042e+08,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,3.000000e+08,2007.000000,5000.0,7.1,2.350000,0
2,Color,Sam Mendes,602.0,148.0,0.000000,161.0,Rory Kinnear,11000.0,2.000742e+08,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,2.450000e+08,2015.000000,393.0,6.8,2.350000,85000
3,Color,Christopher Nolan,813.0,164.0,22000.000000,23000.0,Christian Bale,27000.0,4.481306e+08,Action|Thriller,...,2701.0,English,USA,PG-13,2.500000e+08,2012.000000,23000.0,8.5,2.350000,164000
4,,Doug Walker,,,131.000000,,Rob Walker,131.0,4.846841e+07,Documentary,...,,,,R,3.975262e+07,2002.470517,12.0,7.1,2.220403,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5038,Color,Scott Smith,1.0,87.0,2.000000,318.0,Daphne Zuniga,637.0,4.846841e+07,Comedy|Drama,...,6.0,English,Canada,R,3.975262e+07,2013.000000,470.0,7.7,2.220403,84
5039,Color,Steven Spielberg,43.0,43.0,686.509212,319.0,Valorie Curry,841.0,4.846841e+07,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,3.975262e+07,2002.470517,593.0,7.5,16.000000,32000
5040,Color,Benjamin Roberds,13.0,76.0,0.000000,0.0,Maxwell Moody,0.0,4.846841e+07,Drama|Horror|Thriller,...,3.0,English,USA,R,1.400000e+03,2013.000000,0.0,6.3,2.220403,16
5041,Color,Daniel Hsia,14.0,100.0,0.000000,489.0,Daniel Henney,946.0,1.044300e+04,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,3.975262e+07,2012.000000,719.0,6.3,2.350000,660


In [234]:
data[columns_to_impute].isna().sum()

director_name              0
director_facebook_likes    0
gross                      0
plot_keywords              0
content_rating             0
budget                     0
title_year                 0
aspect_ratio               0
dtype: int64

In [235]:
data.isna().sum()

color                        19
director_name                 0
num_critic_for_reviews       50
duration                     15
director_facebook_likes       0
actor_3_facebook_likes       23
actor_2_name                 13
actor_1_facebook_likes        7
gross                         0
genres                        0
actor_1_name                  7
movie_title                   0
num_voted_users               0
cast_total_facebook_likes     0
actor_3_name                 23
facenumber_in_poster         13
plot_keywords                 0
movie_imdb_link               0
num_user_for_reviews         21
language                     12
country                       5
content_rating                0
budget                        0
title_year                    0
actor_2_facebook_likes       13
imdb_score                    0
aspect_ratio                  0
movie_facebook_likes          0
dtype: int64

In [236]:
data.dropna(inplace=True)

In [237]:
data

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.000000,855.0,Joel David Moore,1000.0,7.605058e+08,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,2.370000e+08,2009.000000,936.0,7.9,1.780000,33000
1,Color,Gore Verbinski,302.0,169.0,563.000000,1000.0,Orlando Bloom,40000.0,3.094042e+08,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,3.000000e+08,2007.000000,5000.0,7.1,2.350000,0
2,Color,Sam Mendes,602.0,148.0,0.000000,161.0,Rory Kinnear,11000.0,2.000742e+08,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,2.450000e+08,2015.000000,393.0,6.8,2.350000,85000
3,Color,Christopher Nolan,813.0,164.0,22000.000000,23000.0,Christian Bale,27000.0,4.481306e+08,Action|Thriller,...,2701.0,English,USA,PG-13,2.500000e+08,2012.000000,23000.0,8.5,2.350000,164000
5,Color,Andrew Stanton,462.0,132.0,475.000000,530.0,Samantha Morton,640.0,7.305868e+07,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,2.637000e+08,2012.000000,632.0,6.6,2.350000,24000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5038,Color,Scott Smith,1.0,87.0,2.000000,318.0,Daphne Zuniga,637.0,4.846841e+07,Comedy|Drama,...,6.0,English,Canada,R,3.975262e+07,2013.000000,470.0,7.7,2.220403,84
5039,Color,Steven Spielberg,43.0,43.0,686.509212,319.0,Valorie Curry,841.0,4.846841e+07,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,3.975262e+07,2002.470517,593.0,7.5,16.000000,32000
5040,Color,Benjamin Roberds,13.0,76.0,0.000000,0.0,Maxwell Moody,0.0,4.846841e+07,Drama|Horror|Thriller,...,3.0,English,USA,R,1.400000e+03,2013.000000,0.0,6.3,2.220403,16
5041,Color,Daniel Hsia,14.0,100.0,0.000000,489.0,Daniel Henney,946.0,1.044300e+04,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,3.975262e+07,2012.000000,719.0,6.3,2.350000,660


In [238]:
data.isna().sum()

color                        0
director_name                0
num_critic_for_reviews       0
duration                     0
director_facebook_likes      0
actor_3_facebook_likes       0
actor_2_name                 0
actor_1_facebook_likes       0
gross                        0
genres                       0
actor_1_name                 0
movie_title                  0
num_voted_users              0
cast_total_facebook_likes    0
actor_3_name                 0
facenumber_in_poster         0
plot_keywords                0
movie_imdb_link              0
num_user_for_reviews         0
language                     0
country                      0
content_rating               0
budget                       0
title_year                   0
actor_2_facebook_likes       0
imdb_score                   0
aspect_ratio                 0
movie_facebook_likes         0
dtype: int64

In [239]:
label_encoder = LabelEncoder()
data['language'] = label_encoder.fit_transform(data['language'])
data['country'] = label_encoder.fit_transform(data['country'])
data['content_rating'] = label_encoder.fit_transform(data['content_rating'])
data['imdb_score'] = label_encoder.fit_transform(data['imdb_score'])

In [240]:
data[['language', 'country', 'content_rating', 'imdb_score']]

Unnamed: 0,language,country,content_rating,imdb_score
0,11,59,7,62
1,11,59,7,54
2,11,58,7,51
3,11,59,7,68
5,11,59,7,49
...,...,...,...,...
5038,11,9,9,60
5039,11,59,10,58
5040,11,59,9,46
5041,11,59,7,46


In [241]:
data = pd.get_dummies(data, columns=['country'], prefix=['country'])

In [242]:
data.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'content_rating',
       'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
       'aspect_ratio', 'movie_facebook_likes', 'country_0', 'country_1',
       'country_2', 'country_3', 'country_4', 'country_5', 'country_6',
       'country_7', 'country_8', 'country_9', 'country_10', 'country_11',
       'country_12', 'country_13', 'country_14', 'country_15', 'country_16',
       'country_17', 'country_18', 'country_19', 'country_20', 'country_21',
       'country_22', 'country_23', 'country_24', 'country_25', 'country_26',
       'country_27', 'country_28', 'country_29', '

In [243]:
data.loc[:,'country_0':]

Unnamed: 0,country_0,country_1,country_2,country_3,country_4,country_5,country_6,country_7,country_8,country_9,...,country_51,country_52,country_53,country_54,country_55,country_56,country_57,country_58,country_59,country_60
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5038,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5039,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5040,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5041,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [244]:
data[['language', 'content_rating', 'imdb_score']]

Unnamed: 0,language,content_rating,imdb_score
0,11,7,62
1,11,7,54
2,11,7,51
3,11,7,68
5,11,7,49
...,...,...,...
5038,11,9,60
5039,11,10,58
5040,11,9,46
5041,11,7,46
