

# Project: Characterstic of the Most Profitable Movies

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusion">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> Movies play a big role in entertainment, therefore competition between companies are high. In this document I'll be analysing a dataset that contain 10,000 data about movies from Kaggle. I'll be asking the question "What are the characterstic of a profitable movie ?". So, I'll clean, explore, and analyze the data and results.

In [None]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html


<a id='wrangling'></a>
## Data Wrangling



### General Properties

In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
df = pd.read_csv('movies.csv')
pd.set_option('display.max_columns', None) # to show all the columns
df.head()

<li>Drop the columns: id, imdb_id, popularity, homepage, tagline, keywords, overview, vote_count, vote_average, budget_adj,	revenue_adj</li>

In [None]:
df.shape

In [None]:
df.columns # to see the the column names

<li>Checking if the column names are written in the best prctice </li>

In [None]:
#drop columns
del_col = ['id','imdb_id','popularity', 'homepage', 'tagline', 'keywords', 'overview', 'vote_count', 'vote_average', 'budget_adj', 'revenue_adj']
df.drop(del_col, 1, inplace=True)
df.head()

In [None]:
df.shape

<li>The columns are dropped</li>

In [None]:
df.duplicated().sum()

In [None]:
df.loc[df.duplicated(),:]

<li>I used loc to show me all the duplicated rows</li>

In [None]:
df.info()

<li>Finding any nulls and cheching the data types as well.</li>

In [None]:
df.isnull().sum()

<li>I found the nulls and I'll drop them, cause I can't replace them</li>

In [None]:
df.dtypes

In [None]:
df.nunique()

<li>release_date type should be changed to a "date"</li>

In [None]:
zero_budget = df[df.budget == 0]
zero_budget.shape

In [None]:
zero_revenue = df[df.revenue == 0]
zero_revenue.shape

<li>I got 5696 rows in budget that have 0 value, and 6016 rows in revenue with 0 (I will delete any zero value in  the revenue and check the budget if any is remaining and then delete any 0 value in budget).</li>

<li>#1 Remove duplicates</li>
<li>#2 delete the null values, and zero values in budget and revenue</li>

### Removing the Duplicates, Null, and Zero values (Data Cleaning)

In [None]:
df.drop_duplicates(inplace=True)
df.shape

<li>Here I dropped the duplicated row and verified that by showing the number of rows and it decremented by 1</li>

In [None]:
zero_col = ["budget","revenue"]
df[zero_col] = df[zero_col].replace(0 ,np.nan)

<li>Over here I replaced any zero value as null to drop it with the null values</li>

In [None]:
df.dropna(inplace=True)
df.shape

<li>Over here I dropped the rows with the null values and inplace True to make the chnages</li>

In [None]:
df.isnull().sum()

<li>Over check after dropping the rows with null values we don't have any nulls now.</li>

In [None]:
df.head()

In [None]:
df['profit'] = df.revenue - df.budget

In [None]:
df.head(1)

<li>Added a new column to display the profit of each movie</li>

In [None]:
df.dtypes

<li>There is changes have been done after dropping the columns that the integer values have changed to float so I will convert the types to interger after cleaning everthing</li>

In [None]:
df['budget'] = df['budget'].astype(int)
df['revenue'] = df['revenue'].astype(int)
df['profit'] = df['profit'].astype(int)
df['release_date'] = pd.to_datetime(df['release_date'])

In [None]:
df.dtypes

<li>Now the types are integers.</li>

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
pd.set_option('display.float_format', lambda x: '%8.2f' % x)

<li>This function fixes the problem of the numbers in .describe(), the numbers are not shown properly.</li>

In [None]:
df.describe()

<li>Over her I showed the statistical describtion of the numerical variables to read them and to highlight the most profitable movies, to have the studies on them. (I will take the movies with a profit >= the 75%).</li>

In [None]:
index_profit = df[df['profit'] < 83473333.00].index
df.drop(index_profit, inplace=True)

In [None]:
df.shape

<li>After dropping everything under the 75% I got 952 movie to study the characteristic of the most profitable movies.</li>

In [None]:
df.hist(column='revenue');

<li>This historgram shows me that the revenue is skewed to the right, it means that the values are above the mean, which shows us the most profitable movies.</li>

In [None]:
df.corr()

It shows the correlation between all the columns.
found that:
<li>The budget and revenue have a +ve correlation</li>
<li>the revenue and profit have a VERY Strong +ve correlation!</li>

<a id='eda'></a>
## Exploratory Data Analysis

### 1- The Relationship between Revenue and the Budget

In [None]:
plt.figure(figsize=(10,6))

plt.title("The Correlation Between The Revenue and Budget")

sns.scatterplot(x=df['budget'], y=df['revenue'])
plt.ylabel("Revenue")
plt.xlabel("Budget");

<li>This correlation graph clearly shows that the relationship between the 'revenue' and the 'budget' columns are weak positive relationship. It doesn't mean that the budget doesn't affect the revenue, the budget have effect on the revenue.</li>

### 2- The Relationship Between the Revenue and the Profit

In [None]:
plt.figure(figsize=(10,6))

plt.title("The Correlation Between The Revenue and Profit")

sns.scatterplot(x=df['profit'], y=df['revenue'])
plt.ylabel("Revenue")
plt.xlabel("Profit");

<li>The correlation above shows that the relationship between the 'revenue' and 'profit' columns are strong positive relationship.</li>

### 3- Average Budget

In [None]:
#Average Budget for a sucecssful movie
df.budget.mean()

### 4- Average Runtime

In [None]:
# Average runtime for a successful movie
df.runtime.mean()

### 5- The Best to have from Cast

In [None]:
def separated_col(column):
    
    data = df[column].str.cat(sep = '|')
    
    
    data = pd.Series(data.split('|'))
    count = data.value_counts(ascending = False)
    
    return count

<li>I have created a loop that makes it easier to separate the columns with "|" easily without affecting the data, and no need to have many lines of codes to write, it is easily used to count the sum of unique values in any column that I select as shown in the codes below.</li>

In [None]:
count = separated_col('cast')
count.head(10)

In [None]:
count = count.head(10)
count.plot(kind='barh', figsize=(10, 8))
plt.title('The frequency of the Cast')
plt.ylabel('Cast Name')
plt.xlabel('Frequency');

### 6- The best to have from Director

In [None]:
count = df.director.value_counts()
count.head(10)

In [None]:
count = count.head(10)
count.plot(kind='pie', figsize=(13, 14));
plt.title('Director')
plt.ylabel('')
plt.xlabel('');

### 7- The best Genres

In [None]:
count = separated_col('genres')
count.head(10)

In [None]:
count = count.head(10)
count.plot(kind='pie', figsize=(13, 14))
plt.title('Genres')
plt.ylabel('')
plt.xlabel('');

### 8- The Best Production_companies

In [None]:
count = separated_col('production_companies')
count.head(10)

In [None]:
count = count.head(10)
count.plot(kind='barh', figsize=(12, 10))
plt.title('The frequency of the Production Companies')
plt.ylabel('Production Company')
plt.xlabel('Frequency');

<a id='conclusion'></a>
## Conclusion

> In conclusion, I have found that the revenue and budget have a weak positive relationship, which means that never the higher the budget the higher revenue, there are more factors that affect the revenue and profit.

> To have a profitable movie:
><li>The average budget would be around : <b>70032738.56$</b>.</li>
><li>The average runtime would be around : <b>115 minutes</b>.</li>
><li>Have <b>Tom Cruise</b> as your main cast.</li>
><li>Have <b>Steven Spielberg</b> the director.</li>
><li>Make your movie around these three genres: <b>Comdey</b> , <b>Action</b> , <b>Drama</b>.</li>
><li>Have either <b>Universal Pictures</b> or <b>Warner Bros</b> as your production company.</li>

><b>Limitations</b> that I faced, there is a large zero values in the 'budget' and 'revenue' columns, I was forced to delete almost half of the data which is alot, in which it affected my analysis and results.