## 1. Introduction 

In this project we will be working in Python to firstly recognize, analyze our data using a wide variety of functions in the pandas library and secondly find the correlations between variables .
For that we´ll use the Movie Industry dataset available in Kaggle, containing 6820 movies (220 movies per year, 1986-2016). Each movie has the following attributes:

Numerical columns: Budget, Gross, Runtime, Score and Votes.\
Categorical columns: Company, Country, Director, Genre, Name, Rating, Star and Writer.\
Date columns: Released and Year.

## 2. Importing Libraries 

In [None]:
# We need to install a wide variety of libraries. 
#For this we will install pandas, numpy, seaborn and matplotlib libraries.
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
plt.style.use('ggplot')
from matplotlib.pyplot import figure

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8)

pd.options.mode.chained_assignment = None

## 3.Data Collection

**Loading and Reading the dataset**


In [None]:
# Now we need to read in the data

df = pd.read_csv('../input/movies/movies.csv', encoding = "ISO-8859-1")
df.head()

In [None]:
# shows the analysis of numerical values.
df.describe().T

- **the dataset has 6820 titles.**
- **The studied time lapse goes from 1986 to 2016.**

## 4. Checking Data

**Is there any Missing Data ?**

In [None]:
# Let's loop through the data and see if there is anything missing
 
for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('%-15s  %-15s' %(col,pct_missing))

In [None]:
sns.heatmap(df.isnull(), cbar=False) 
plt.title('Missing Values ', fontsize = 14)
#plt.rcParams['figure.figsize'] = (5,3)
plt.show()

**There is no missing values in the dataset**

In [None]:
#Data types for our colums

df.dtypes

**the values in column "Budget" and "Gross" are "float" datatypes but there is no digits after the decimals.**

## 5.Data Cleaning

**5.1 the values in column "Budget" and "Gross" are "float" datatypes but there is no digits after the decimals. So we will change the Type to integer.**



In [None]:
#Change data Type of colums

df['budget']=df['budget'].astype('int64')

df['gross'] = df['gross'].astype('int64')

df.head()

**5.2 The year has incorrect values while we compare it with the released date column**

In [None]:
#Create correct year column from released column
df['YearCorrect'] = df['released'].astype(str).str[:4]
df.dtypes

In [None]:
#This will allow you to see all column names & rows

#pd.set_option('display.max_rows',None)

In [None]:
#Viewing the top movies with highest gross 

df = df.sort_values(by=['gross'],ascending=False)
df.head()

**5.3 Checking for duplicates**

In [None]:
#Checking the existence of duplicated rows
df.duplicated().sum()

#Drop any duplicates
#df.drop_duplicates()

## 6. Data Analysis

**6.1 Company analysis**

In [None]:
comp = df.groupby(['name','company'])['budget','gross'].sum().sort_values(by='gross',ascending=False)
comp.head(10)

In [None]:
df.groupby('company').size().plot(kind = "bar")

In [None]:
topcom = comp.reset_index()
topcom.head()

In [None]:
company = df['company'].value_counts()
company = pd.DataFrame(company) 
company = company.head(10) 
company.head(3)



In [None]:
sns.barplot(x = company.index, y = company['company'])

labels = company.index.tolist()
plt.gcf().set_size_inches(15, 7)

plt.title('Company vs. Movies released', fontsize = 15)
plt.xlabel('Company', fontsize = 15)
plt.ylabel('Released movies', fontsize = 15)
#plt.rcParams['figure.figsize'] = (1,2)
plt.xticks(ticks = [0,1,2,3,4,5,6,7,8,9] , labels = labels, rotation = '45')

plt.show()

In [None]:
Perc = company.sum() / df.shape[0] * 100
Perc

**6.2 Genre and Rating Analysis**

In [None]:
plt.figure(figsize = (12,10))
sns.countplot(x = 'rating',data = df ,hue='genre')
plt.legend(loc='upper center')
plt.show()

In [None]:
df['rating'].value_counts().plot.bar()


In [None]:
df1 = {key: df for key, df in df.groupby('genre')}
df1.keys()
action, musical, comedy, family = df1['Action'], df1['Musical'], df1['Comedy'], df1['Family']

In [None]:
action.head(10).sort_values(by = ['gross', 'budget'], ascending=False)

In [None]:
comedy.head(10).sort_values(by = ['gross', 'budget'], ascending=False)

In [None]:
family[(family.rating == 'G') | (family.rating == 'PG')].sort_values(by = ['rating', 'score'], ascending = [True, False]) 

- **We concluded that most of the movies are R and PG-13 rated, and that most movies are from Adventure,Action and Comedy.**
- **G rated movies are mostly family ones.**

## 7. Correlation Analysis

**My Guesses**
- I think **Budget** will have a high Correlation because the more money they will spend the more they will get.
- Also the **company** cuz bigger companies like Disney make movies that bring so much of money.

In [None]:
#let's start looking at correlation
df.corr(method='pearson')  #pearson , kendall , spearman

In [None]:

correlation_matrix = df.corr(method='pearson')
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Matrix for Numeric Features')
plt.xlabel('Movie Features')
plt.ylabel('Movie Features')
plt.rcParams['figure.figsize'] = (10,10)
plt.show()

**So I was right ! There is a clear and high correlation between 'Budget' and 'Gross'**

In [None]:
#looks at company

df.head(10)

In [None]:
# Using factorize - this assigns a random numeric value for each unique categorical value

df.apply(lambda x: x.factorize()[0]).corr(method='pearson')

In [None]:
correlation_matrix = df.apply(lambda x: x.factorize()[0]).corr(method='pearson')

sns.heatmap(correlation_matrix, annot = True)

plt.title("Correlation matrix for Movies")

plt.xlabel("Movie features")

plt.ylabel("Movie features")

plt.show()

In [None]:
correlation_mat = df.apply(lambda x: x.factorize()[0]).corr()

corr_pairs = correlation_mat.unstack()

corr_pairs.head(100)

In [None]:
sorted_pairs = corr_pairs.sort_values(kind="quicksort")

corr_pairs.head(100)

In [None]:
# We can now take a look at the ones that have a high correlation (> 0.5)

strong_pairs = sorted_pairs[abs(sorted_pairs) > 0.7]

strong_pairs.head(30)

#### - votes and budget have the highest correlation to gross earnings
#### - I was wrong ! company has low correlation


In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey = True)

plt.gcf().set_size_inches(15, 7)
ax1.scatter(df.budget, df.gross, c = 'pink')
ax1.set_title('Budget vs. Gross', c = 'pink', fontsize = 25)
ax2.scatter(df.votes, df.gross, c='blue')
ax2.set_title('Votes vs. Gross', c ='blue', fontsize = 25)

plt.ylabel('Gross', fontsize = 25)

plt.show()

- **Low budget and voted movies seem to have poor profit.**
- **As the budget raises, there is an exponencial tendency for gross improvement.**

#### Thanks for reaching the end! Upvote if you liked it!

# 