## Final Project
### In order to solve this set of questions you will use the dataset in the relative directory ./tmdb_5000_movies.csv


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Open Your CSV file and print the first 10 rows in a good format ###

In [2]:
data = pd.read_csv("tmdb_5000_movies.csv")
print(data.head(10))

      budget                                             genres  \
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  300000000  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2  245000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3  250000000  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4  260000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
5  258000000  [{"id": 14, "name": "Fantasy"}, {"id": 28, "na...   
6  260000000  [{"id": 16, "name": "Animation"}, {"id": 10751...   
7  280000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
8  250000000  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
9  250000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

                                            homepage      id  \
0                        http://www.avatarmovie.com/   19995   
1       http://disney.go.com/disneypictures/pirates/     285   
2        http://www.sonypictures.com/movies/spectre/  206647   
3     

##### Exploring data

### Check if any NANs in your dataset and fill them with a good filer ###

In [3]:
data = data.replace(np.nan, "-") # replaced all of the null with "-"
data.isna().sum() # check if there's a null value left

budget                  0
genres                  0
homepage                0
id                      0
keywords                0
original_language       0
original_title          0
overview                0
popularity              0
production_companies    0
production_countries    0
release_date            0
revenue                 0
runtime                 0
spoken_languages        0
status                  0
tagline                 0
title                   0
vote_average            0
vote_count              0
dtype: int64

In [3]:
"""
We can fill the datasets by many ways. depending on the way we see the data. the filling method always depend on the 
understanding of the columns variables
"""
# mean is not always the best solution, if there is any outliers, it will be misleading. 
#more suitable with symmetric data distribution
"""dataset.fillna(dataset.mean(), inplace = True) """

# median is a good solution with skewed data. but since we have categoricals variables I cannot use it
"""dataset.fillna(dataset.median(), inplace = True)"""

# also we can use linear interpolation to fill the nan data,
# in this code snippet it will change the nan with the most suitable number, to follow the direction of increasing.
"""dataset.interpolate(method ='linear', limit_direction ='forward')"""

"dataset.interpolate(method ='linear', limit_direction ='forward')"

In [None]:
releasedate and runtime

### Discover the types of each columns and modify them if needed ###

In [4]:
print(data["release_date"].head(10))
print(data.dtypes) # check the type of each column

data["release_date"] = pd.to_datetime(data["release_date"], format="%Y-%m-%d", errors='coerce')
print(data.head(2))
print(data.dtypes)

0    2009-12-10
1    2007-05-19
2    2015-10-26
3    2012-07-16
4    2012-03-07
5    2007-05-01
6    2010-11-24
7    2015-04-22
8    2009-07-07
9    2016-03-23
Name: release_date, dtype: object
budget                    int64
genres                   object
homepage                 object
id                        int64
keywords                 object
original_language        object
original_title           object
overview                 object
popularity              float64
production_companies     object
production_countries     object
release_date             object
revenue                   int64
runtime                  object
spoken_languages         object
status                   object
tagline                  object
title                    object
vote_average            float64
vote_count                int64
dtype: object
      budget                                             genres  \
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  300000000  [{"i

### Give a very simple statistical analsys for the numerical columns ###

In [5]:
numerical_cols = data.select_dtypes(include=["int64", "float64"]).columns
print(numerical_cols)

Index(['budget', 'id', 'popularity', 'revenue', 'vote_average', 'vote_count'], dtype='object')


### Calculate the mean rate for [1999, 1980, 2004, 2017] the movies grouped by years ###

In [10]:
target_years = [1999, 1980, 2004, 2017]
data["year"] = data["release_date"].dt.year
filtered_data = data[data["year"].isin(target_years)]
mean_ratings = filtered_data.groupby("year")["vote_average"].mean()
print(mean_ratings)

year
1980.0    6.609091
1999.0    6.110526
2004.0    6.104412
2017.0    7.400000
Name: vote_average, dtype: float64


### Rearange the dataframe based on revenue - budget values ###

In [13]:
data["profit"] = data["revenue"] - data["budget"]
data_sorted = data.sort_values(by="profit", ascending=False)
print(data_sorted[["title", "budget", "revenue", "profit"]].head(10))

                                             title     budget     revenue  \
0                                           Avatar  237000000  2787965087   
25                                         Titanic  200000000  1845034188   
28                                  Jurassic World  150000000  1513528810   
44                                       Furious 7  190000000  1506249360   
16                                    The Avengers  220000000  1519557910   
7                          Avengers: Age of Ultron  280000000  1405403694   
124                                         Frozen  150000000  1274219009   
546                                        Minions   74000000  1156730962   
329  The Lord of the Rings: The Return of the King   94000000  1118888979   
31                                      Iron Man 3  200000000  1215439994   

         profit  
0    2550965087  
25   1645034188  
28   1363528810  
44   1316249360  
16   1299557910  
7    1125403694  
124  1124219009  
546  108

### Allocate the year with the largest movies released in ['Action', 'Romance'] ###

In [14]:
filtered = data[data["genres"].str.contains("Action|Romance", case=False, na=False)]
movies_per_year = filtered["year"].value_counts().sort_values(ascending=False)
most_active_year = movies_per_year.idxmax()
count = movies_per_year.max()

print(f"The year with the most 'Action' or 'Romance' movies is {most_active_year} with {count} movies.")

The year with the most 'Action' or 'Romance' movies is 2009.0 with 108 movies.


### Find the movies with top 5 revenue - budget value ###

In [15]:
print(data_sorted[["title", "budget", "revenue", "profit"]].head(5))

             title     budget     revenue      profit
0           Avatar  237000000  2787965087  2550965087
25         Titanic  200000000  1845034188  1645034188
28  Jurassic World  150000000  1513528810  1363528810
44       Furious 7  190000000  1506249360  1316249360
16    The Avengers  220000000  1519557910  1299557910


### Find the year with the highest number of movies release ###

### Find the top 2 countries with the highest production movies number ###

### Find the top 1 company with the highest production movies number ###



### Is there any relation between the runtime and average vote value ### **


### Find the top 5 movies with the highest rate, and find if there anything common between them. ###



### Find the most unsuccessful movie along time in terms of revenue - budget ###



### Rearange the dataframe based on vote_average column values ###



### Rearange the dataframe based on runtime column values ###



### Find the top 5 successful years for the USA cinema based on the total income devided by number of movies  ###



### Find the most succesfull movie in [USA, UK] ###



### In your opinion what is the highest variable that affects the revenue value (high coloration) ### PONUS



## Good Luck