Exploratory data analysis (EDA) is a crucial component of data science which allows one to understand the basics of what your data looks like and what kinds of questions might be answered by them. For this task, we are going to clean, sanitise and explore our data. Using the movies dataset, answer the following questions by writing code in the cells.


In [None]:
# Importing the required packages here

import numpy as np
import pandas as pd
import seaborn as sns
import ast, json

from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline


In [None]:
#### load the movie dataset and create their dataframes

movies_df = 


### Data Cleansing 
#### Clean the data. Identify columns that are redundant or unnecessary.

It is always easier to make your decisions based on data which is relevant and concise. Remove the following columns ['keywords', 'homepage', 'status', 'tagline', 'original_language', 'overview', 'production_companies', 'original_title'] from the data set as they will not be used in the analysis.

In [None]:
# code here

#### Remove any duplicate rows

In [None]:
# code here

#### Some movies in the database have zero budget or zero revenue which implies that their values have not been recorded or some information is missing. Discard such entries from the dataframe.

In [None]:
# Code here

#### To manipulate the columns easily, it is important that we make use of the python objects. Change the release date column into Date format and extract the year from the date. This will help us in analysing yearly data.

In [None]:
# Change the release_date column to DateTime column


# Extract the release year from every release date


#### Change budget and revenue columns format to integer using numpy’s int64 method.

In [None]:
# code here

On checking the dataset, we see that genres, keywords, production_companies, production_countries, spoken_languages are in the JSON format which will make it difficult to manipulate the dataframe. Now let’s flatten these columns into a format that can be easily interpreted.

In [None]:
def parse_col_json(column, key):
    """
    Args:
        column: string
            name of the column to be processed.
        key: string
            name of the dictionary key which needs to be extracted

    Results:
        movies_df will have column dropped and replaced with a new column for each unique value
        For example, if the "genres" column had an "Action" in it, there will be a new column
        called "genres_Action". Every movie that had the "Action" genre will have a 1 in that column,
        and 0 otherwise.
    """
    global movies_df # ensure that we can directly manipulate movies_df
    new_columns = {} # Keeps a track of all unique names
    for index,i in zip(movies_df.index,movies_df[column].apply(json.loads)):
        # For each Dataframe index, zipped with a JSON object for that column
        # We want to get a list of all of the items associated with 'key'
        list1=[]
        for j in range(len(i)):
            # For each item in the current JSON object
            list1.append((i[j][key]))# Append the item to a list
        for item in list1:
            # For each item found, append to movies_df
            if f"{column}_{item}" not in new_columns.keys():
                # If this item doesn't have a corresponding column, create one
                new_columns[f"{column}_{item}"] = np.array([0] * movies_df.shape[0])
            new_columns[f"{column}_{item}"][index] = 1
    # Concatenate new columns to movies_df
    movies_df = pd.concat([movies_df, pd.DataFrame(new_columns)], axis=1).drop(column, axis=1)
            
    
parse_col_json('genres', 'name')
parse_col_json('spoken_languages', 'name')
parse_col_json('production_countries', 'name')


movies_df.columns

### Finding Certain Genres
Let's say that we want to locate all movies in the "Action" genre. With this new format, it becomes a simple matter.


In [None]:
action_movies = movies_df[movies_df.genres_Action == 1]
action_movies.head()

Unnamed: 0,budget,homepage,id,original_language,original_title,overview,popularity,release_date,revenue,runtime,...,keywords_personality disorder,keywords_serial kiler,keywords_latino lgbt,keywords_gang initiation,keywords_gunplay,keywords_homeless,keywords_arms,keywords_paper knife,keywords_guitar case,keywords_postal worker
0,237000000,http://www.avatarmovie.com/,19995,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,2009-12-10,2787965087,162.0,...,0,0,0,0,0,0,0,0,0,0
1,300000000,http://disney.go.com/disneypictures/pirates/,285,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,2007-05-19,961000000,169.0,...,0,0,0,0,0,0,0,0,0,0
2,245000000,http://www.sonypictures.com/movies/spectre/,206647,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,2015-10-26,880674609,148.0,...,0,0,0,0,0,0,0,0,0,0
3,250000000,http://www.thedarkknightrises.com/,49026,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,2012-07-16,1084939099,165.0,...,0,0,0,0,0,0,0,0,0,0
4,260000000,http://movies.disney.com/john-carter,49529,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,2012-03-07,284139100,132.0,...,0,0,0,0,0,0,0,0,0,0


### Now onto the exploration

#### Identify relationships between variables / features

The main goal here is to identify and create relationships which can help you to build ideas. I have defined questions which can help you identify some relationships to explore.

#### Which are the 5 most expensive movies? How do the most expensive and cheapest movies compare? Exploring the most expensive movies help you explore if some movies are worth the money spent on them based on their performance and revenue generated.

In [None]:
# Code here



#### What are the top 5 most profitable movies? Compare the min and max profits. The comparison helps us indentify the different approaches which failed and succeeded. Subtracting the budget from the revenue generated, will return the profit earned.

In [None]:
# code here




#### Find the most talked about movies. Sort the dataframe based on the popularity column.

#### Find Movies which are rated above 7



In [None]:
# Code here




#### Most successful genres — create a bar plot explaining the frequency of movies in each genre.

In [None]:
 # Code here

In [None]:
#### Generate three different interesting visualisations with a data story.









# Now that you know how to Explore a Dataset, it's time for you to do it from start to end. Please find the Automobile Dataset in your task folder. 

### You are expected to create a report ('eda.docx' provides a template for what this report should look like) in which you explain your visualisations, investigations and findings. The Code for the Analysis should be in a jupyter notebook named automobile.ipynb.

## Be creative :)