

# Project: Investigate a Dataset - [Tmdb-Movies]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

his data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters.
There are some odd characters in the ‘cast’ column. 

### Question(s) for Analysis
>**Qus1**: Whats is top 10 revenue movies?
         

> **Qus2**: What is Top 5 Movies by popularity ?


> **Qus3**: Have the films Budget been affected by the passage of time during period from 1960 to 2015 ?

In [1]:
# Use this cell to set up import statements for all of the packages that you plan to use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot 
from matplotlib import pyplot as plt
import seaborn as sns



In [2]:
# Upgrade pandas to use dataframe.explode() function. 
#pip install --upgrade pandas==0.25.0

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you **document your data cleaning steps in mark-down cells precisely and justify your cleaning decisions.**


### General Properties
> **Tip**: In this Part I will discover My Data

In [3]:
#load Data
df=pd.read_csv('../input/tmdb-movies-dataset/tmdb_movies_data.csv')
df.head(3)

In [4]:
# i will discover my data by name of columns
df.columns

In [5]:
#its method to show summarize Descriptive statistics
df.describe()

In [6]:
#its method to show how many samples and information about dataset if exist missing values or not 
df.info()

In [7]:
## it will show you that number of columns = 21  And  number of rows =10866 
df.shape
 


## Data Cleaning
> **Tip**: In This Part I will fix or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data 
 

In [8]:
# this code will remove columns that i will not need it at my analysis 
df.drop(['id', 'imdb_id',  'homepage', 'tagline',  'overview', 'production_companies', 'budget_adj', 'revenue_adj','keywords'], axis=1, inplace=True)


In [9]:
# this to check after modify
df.head(2)

In [10]:
# this will show the last five from dataset :
df.tail()

In [11]:
# after i used method tail() i found that there values in budget and revenue contains zero so i droped them :
df.drop(df[df['budget'] == 0].index, inplace=True)
df.drop(df[df['revenue'] == 0].index, inplace=True)

In [12]:
# ichanged name of column "runtime" to : 'Time'
df.rename(columns={'runtime':'Time'},inplace=True)
df.head(3)

#### 
> **Tip**: In this part i will check if exist missing values or not

In [13]:
# i will make check about missing values by using Function
def check_Missing_values():
    return df.isnull().sum()
check_Missing_values()

In [14]:
# droping massing values...
df.dropna(subset=['cast','director'],axis=0,inplace=True)

In [15]:
# recall function method to make shure thta all missing values removed :
check_Missing_values()


> **Tip**:In this part i will check if exist duplicates values or not 

In [16]:
df.duplicated().any()

In [17]:
# droping duplicates values
df.drop_duplicates(inplace=True)
df.duplicated().any()

##
> **Tip**: :the shape data after cleaning

In [18]:
df.head(3)

<a id='eda'></a>
## Exploratory Data Analysis

> **1-**: In this Part i will explore data and make some visulizations






#### its fast look about My  Dataset

In [19]:
df.hist(figsize=(9,9))

#### explore data by pivot_table

In [20]:
df.pivot_table(index='original_title',values='revenue',columns='release_year',aggfunc=np.mean).head(10)

### explore corelations between tables

In [21]:
df.corr()

In [22]:
sns.heatmap(df.corr(), cmap='YlGnBu', annot=True, linewidths = 0.2)

###  Question 1 What is the Top 10 revenue Movies?

In [23]:
info = pd.DataFrame(df['revenue'].sort_values(ascending = False))
info['original_title'] = df['original_title']
data = list(map(str,(info['original_title'])))
x = list(data[:10])
y = list(info['revenue'][:10])

#make a plot usinf pointplot for top 10 profitable movies.
ax = sns.pointplot(x=y,y=x)

#setup the figure size
sns.set(rc={'figure.figsize':(10,5)})
#setup the title and labels of the plot.
ax.set_title("Top 10 revenue Movies",fontsize = 15)
ax.set_xlabel("revenue",fontsize = 13)
sns.set_style("darkgrid")

## resulats

* Above are the top 10 movies  by revenue.
*Avatar and Star Wars Achieve high revenue

### Question 2: What is Top 5 Movies by popularity ?

In [24]:
#to check The most popular Value
df['popularity'].max()

In [25]:
# i will discover my data to collect between popularity and movies names :
top_five_movies_by_pop=df.pivot_table(values='popularity',index="original_title").sort_values(ascending=False,by='popularity').head(5)
top_five_movies_by_pop

In [26]:
top_five_movies_by_pop.popularity.plot(kind='bar',fontsize=11,edgecolor='black')
plot.title('Top 10 Movies by popularity',weight='bold')
plot.ylabel('popularity',fontsize=13,weight='bold')
plot.xlabel('movies_title',fontsize=13,weight='bold');

## results

* Above are the top 5 movies by popularity.
* After Analysis And Visulaization I found That Movie Jurassic World get high popularity from all movies it take about (32.985763)

###  Question 3 : Have the films Budget been affected by the passage of time during period from 1960 to 2015 ?

In [27]:
# i made pivot table to make connect betwen year overtime and Budget :
year_budget=df.pivot_table(values='budget',index='release_year')
year_budget

In [28]:
year_budget.plot(kind='line' , fontsize='12')
plot.xlabel('release_year',weight='bold')
plot.ylabel('budget',weight='bold')
plot.title('The degree of budget impact on films over the years',size = 10,weight='bold')
plot.show()

## results

* Above the Are relation Between Years and Budget  How it effected 
* after analysis and visualization  The budget in the sixties and seventies, i.e. before the millennium, was very small due to several reasons, including that the development had not reached what it was, and the films, whether or the music were not like our present time.
* Because if we look at the present time, for example, 2010, 2011, 2015, and 2007,2002, we find that the budget for films is constantly rising due to the availability of possibilities that did not exist before for example technology .


<a id='conclusions'></a>
## Conclusions

* The budget for films  in the past was small compared to our current time
* Higher popularity movies were found to be more popular than their lower counterparts.
* Avatar and Star Wars Achieve high revenue



#### limitation: 
limitation: The are some limitation in the dataset such as erroneous values on budget and revenue that effects a lot in the analysis so I had to drop many rows based on that. Although there are some missing values in some columns and i found duplicates values Also I did not need those columns in my analysis so dropping them was the best solution.
and i did rename one column becuse i found it not suitable.
Since there are around 10,500 data points, we do not use scatter plot as this could run into the issue of over-plotting so I pushed it away far my visualization