

# Project: Investigate a Dataset - [TMDb Dataset]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

> In this section of the report, we provide a brief introduction to the dataset we've selected for The analysis.
The selected file to be analyzed is named tmdb-movies.csv provided by Kaggle, it contains information about 10,000 movies collected from The Movie Database (TMDb),including user ratings and revenue. in this report, we would invistegate more interesting patterns and insights. 



### Question(s) for Analysis
>In order to understand the dataset and looking for  the hidden insights, we pose many question as following: 
>1)Which genres are most popular over Years? 
>2)which years have the  most movies releases?
>3)What are the 5 Top production Companies?


In [None]:
# In this cell, we set up several statements for Importing all analysis related packages.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snb
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling




### General Properties
> in this section we will explore the data at hand.
One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

In [None]:
# here in this cell, we will Load data and print out a few lines using the folloing statements.
df=pd.read_csv('Database_TMDb_movie_data/tmdb-movies.csv')
df.head()

In [None]:
#exploring data properities(shape)
df.shape

> as seen, the data consists of more than 10K movies and 21 attributes(columns) with various information about each movie.

In [None]:
df.info()

>to have a clear vision on the dataset , we showed the data structure by exploring the attributes of the dataset

In [None]:
df.describe()

> We would like to describe the basic features of the data to give a better understanding of the behavior/pattern of data, how it looks like and how spread out it is. Therefor, we explored -As shown above- data statistic like mean, mode, standard deviation.

>as we see, 


### Data Preparation &  Cleaning
> The dataset was not ready for analysis. So, we need to check the shortcoming if any, then prepare the data by cleaning, handling the missing values, remove some irrelative attributes, and divide others. However, based on our understanding of the data, we deleted some rows that contained wrong,or useless values we will discuss it later.



# 1) removing duplication

In [None]:
# checking if there is an duplications
df.duplicated().sum()

> there is one duplication case in rows

In [None]:
# to deal with this duplication, we will drop the duplicated row
df= df.drop_duplicates(keep='first')
#checking again to ensure the duplicated row is dropped 
df.shape

# 2) Handling missing values

In [None]:
#checking if there is and missing values
df.isnull().sum()

>As. we can see above, there are  about 10,609 missing values  that are distributed in 9 columons 

In [None]:

#to deal with the missing values, we need replace those values with Zero by using (filna) function
df1=df.fillna(0)

# 3) Attributes omission
we decided to remove some attributes from our analysis because they are irrelevant 

In [None]:
#The irrelevant attributes (columns) such as imdb_id, homepage,tagline, overview, budget_adj and revenue_adj, so we will drop theses attributes
df.drop(['imdb_id','homepage','tagline','overview','budget_adj','revenue_adj'],axis =1,inplace = True)
df.head()

> So far, we had done many steps. We first collected the dataset as CSV file, we started to got a quick look of the data, then we explored the general propertities of the data, we showed its structure or dimensions, and we figured out basic statistics to have a better understanding of data. 
After initial exploration the data, we had done several preprocessing and preparations techniques like; 1-checking any shortcomings in the data in terms of there is any duplication, missing values, and deal with it.2- attribute omission, we removed all irrelevent columns from our analysis.        

<a id='eda'></a>
## Exploratory Data Analysis

> Now after we have done many preprocessing techniqes, ou dataset is ready for exploration in order to look for variables dependencies, examin the relationships between the given data attributes using through certain statistics tests and support these results with visulizations. To start in this stage, we have to state several research questions. these questions will be defined in the following; 

>Note that at least two or more kinds of plots should be created as part of the exploration, and you must  compare and show trends in the varied visualizations. 






### Which genres are most popular over Years? 

In [None]:
# since each has multiple genres seperated with '|', we will split the string by writing the following function, and count each genr
def data (splt_gnr):
# we will include all stated gerns of each row.
    data_plot = df[splt_gnr].str.cat(sep = '|')
    data = pd.Series(data_plot.split('|'))
    info = data.value_counts(ascending=False)
    return info
    

In [None]:
sum_genre_movies = data('genres')
print(sum_genre_movies)

In [None]:
#now we are going to visualize the movies gerns distribution using plot function
sum_genre_movies.plot(kind='bar', figsize= (14,6), fontsize=15)
#we will define the plot's title and label.
plt.title("Most popular released movies' Genrs",fontsize=16)
plt.xlabel(' Total Movies',fontsize=15)
plt.ylabel("Genres",fontsize= 15)

>As presented above, the most popular movies'genres is Drama, Comedy, and Thriller, and action. this can give us a preception that the greatest audiance of these movies  are young group
>the lowest popularity is Western movies.

### which years have the  most movies releases?

In [None]:

#now we are going to visualize the highst year according movies release
count_release=df['release_year'].value_counts().sort_index()
count_release

In [None]:
#we would creat a plot that present the count total releases in each year as following
count_release.plot(x='release_year',kind='bar',figsize=(14,6),fontsize = 15)
# here we set plot labels and titles.
plt.title(' Number Of Movie Releases by Years',fontsize = 15)
plt.xlabel('Year',fontsize = 15)
plt.ylabel('Number Of Releases',fontsize = 15)

In [None]:
#we can use another type plot Line chart)
count_release.plot(x = np.arange(1960,2016,5))
snb.set(rc={'figure.figsize':(14,6)})
plt.title(" Number Of Movies per Years",fontsize = 16)
plt.xlabel('Year of Release',fontsize = 15)
plt.ylabel('Number Of Movies',fontsize = 15)
#setting style
snb.set_style("whitegrid")

As we can see, the overall releases trend is increasing through years which can be intepreted by the increases of film-making industry and demand ingreasing. However, 2014 has the largest number of movies releases. 

### What are the 5 Top production Companies?

In [None]:
# the companies are seperated with "|", so fist we need to split it  
pc= df['production_companies'].str.get_dummies(sep='|')

In [None]:
ppc=pc[pc.columns].apply(lambda x: sum(x.values))

In [None]:
#setting pie chart 
ppc.sort_values(0,ascending=False).head(5).plot.pie(autopct='%1.1f%%',shadow=True,frame=True)
plt.show()


##### as we can see from the pie chart, the top 5 Production Companies are as following:
1,Universal Pictures with 25.9% of total produced movies.
2,Warner Bros with 16.9% of total produced movies.
3,Paramoint Pictures wit 14.3% of total produced movies.
4,Twentieth Century Fox Firm Corporation with 9.4% of total produced movies.
5,Colombia Pictures with 9.1% of total produced movies.


<a id='conclusions'></a>
## Conclusions

> In conclusion, this project contains a practical exercise of data investigation process and an understanding of the basics of data discovery step should every data analyst considers.  
Regarding our dataset(The Movies Database) We have done many steps during the ivestigation;
data describtion: we set some questions for help us investigate our data, 
Data wrangling: we xplored some issues in the data that affet its readiness for the analysis, therefore, we have done many steps in data wrangling like; 1)removing one duplicated value, 2)handling over than 100,600 missing values,3) attribute omission; where we removed irrelevant columns like; 'imdb_id','homepage','tagline','overview','budget_adj','revenue_adj'.



>In terms of the Exploratory Data Analysis phase .we can conclude our finding of the stated questions in the following: 
The most movie genres preferred by the audience are drama, comedy, and thriller.
With regard to the number of films produced, we find that the most years in which the years were released are 2014 and 2015, followed by 2015.
Another observation we found, is that the trend of movies releases has been clearly increasing over the years, and this comes with the development of the film industry and the demand for it.
Finally, most of the film-making companies are depicted in a pie chart as shown above.

### Regarding the limitations,  
>we faced it was during the data cleaning process, the null values was a limitation. since we couldn't drop the rows that contains null values but it will affect the overall analysis. this led me to replace it with Zero. 



In [32]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

0