## Movie analysis

Hi! I'm new to ML so I'd be glad to receive your feedback!!

**Importing modules**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import seaborn as sns
import geopandas as gpd
from wordcloud import WordCloud

import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
sns.set()

%matplotlib inline

**Loading and getting to know the dataset**

In [None]:
dataset = pd.read_csv('../input/movies/movies.csv', encoding = "ISO-8859-1")
dataset.head()

**The set has:**

   - Numerical columns: Budget, Gross, Runtime, Score and Votes.
   - Categorical columns: Company, Country, Director, Genre, Name, Rating, Star and Writer.
   - Date columns: Released and Year.

**Missing values?**

In [None]:
sns.heatmap(dataset.isnull(), cbar=False) 
plt.title('Valores faltantes por columna y posición', fontsize = 15)
plt.show()

*There are no missing values in the set*

**Numerical features description**

In [None]:
dataset.describe().T

Some conclusions:
- El set has 6820 titles.
- The studied time lapse goes from 1986 to 2016.
- The average film duration is 1h 46min.

**How many films there are in the set per year?**

In [None]:
sns.distplot(dataset['year'], bins = 5, color = 'orange', label = 'KDE')
plt.legend()
plt.gcf().set_size_inches(12, 5)

*It seems there are the same amounts of movies studied per each year!*

**Seeing the oldest and newest movies in the set**

*Oldest released movies*

In [None]:
Oldest = dataset.sort_values("released", ascending = True)
Oldest[['name', "released"]][:10]

*Newest released movies*

In [None]:
Newest = dataset.sort_values("released", ascending = False)
Newest[['name', "released"]][:10]

**10 countries with most released films**

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) # Loading gpd file
world.head(3)

**Country names in my dataset**

In [None]:
country_geo = list(world['name']) # Countries in 'naturalearth_lowres' 
country_data = list(dataset['country'].unique()) # Countries in my dataset

country_diff = [country for country in country_data if country not in country_geo]
country_diff # Countries with different names

In [None]:
dataset['country'] = pd.DataFrame(dataset['country'].replace(
    {'USA':'United States of America','UK':'United Kingdom',
     'West Germany':'Germany', 'Hong Kong':'China',
     'Soviet Union': 'Russia', 'Czech Republic':'Czech Rep.'})) # Changing country name from my dataset

**Countries with the most released films**

In [None]:
Countries = pd.DataFrame(dataset['country'].value_counts())
Ten_countries = pd.DataFrame(dataset['country'].value_counts()).head(10)

sns.barplot(x = Ten_countries.index, y = Ten_countries['country'])

labels =Ten_countries.index.tolist()
plt.gcf().set_size_inches(15, 7)

plt.title('Countries vs movies released', fontsize = 20)
plt.xlabel('Country', fontsize = 15)
plt.ylabel('Movies released', fontsize = 15)

plt.xticks(ticks = [0,1,2,3,4,5,6,7,8,9] , labels = labels, rotation = '45')
plt.show()

**Geographic plot**

In [None]:
Temp = Countries.index.to_frame(index=False, name = 'countries')
Temp2 = Countries.reset_index(drop = True)
Temp2 = Temp2.rename(columns={'country': 'Total_movies'})
Temp3 = Temp.join(Temp2)

In [None]:
mapped = world.set_index('name').join(Temp3.set_index('countries')).reset_index()

to_be_mapped = 'Total_movies'
vmin, vmax = 0,4900
fig, ax = plt.subplots(1, figsize=(15,15))

mapped.dropna().plot(column=to_be_mapped, cmap='Blues', linewidth=0.9, ax=ax, edgecolors='0.6')
ax.set_title('Movies per country', fontdict={'fontsize':20})
ax.set_axis_off()

sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin = vmin, vmax = vmax))
sm._A = []

cbar = fig.colorbar(sm, orientation='horizontal')

In [None]:
Per_country = (Countries.sum() / 6820 * 100)
Per_country

- The 10 countries with most released films concentrate the 94.7% of all the released films in those 30 years.

**Company analisis**

In [None]:
dataset.groupby('company').size()

*There are 2179 different companies*

In [None]:
company = dataset['company'].value_counts()
company = pd.DataFrame(company) 
company = company.head(10) 
company.head(3)

In [None]:
sns.barplot(x = company.index, y = company['company'])

labels = company.index.tolist()
plt.gcf().set_size_inches(15, 7)

plt.title('Company vs. Movies released', fontsize = 20)
plt.xlabel('Company', fontsize = 15)
plt.ylabel('Released movies', fontsize = 15)
plt.xticks(ticks = [0,1,2,3,4,5,6,7,8,9] , labels = labels, rotation = '45')
plt.show()

In [None]:
Porcentaje = company.sum() / dataset.shape[0] * 100
Porcentaje

**Conclusion:**

- The 10 biggest companies around the world concentrate 27% of all the movies released within those 30 years.

**Genre and rating**

In [None]:
dataset['rating'].value_counts().plot.pie(autopct='%1.1f%%',shadow=True,figsize=(10,8))
plt.title('Rating percentages', fontsize = 20)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize = (22,10))
sns.countplot(x = 'rating',data = dataset ,hue='genre')
plt.legend(loc='upper center')
plt.show()

*Let's see some Adventure films from USA*

In [None]:
tag = "Adventure"
small = dataset[dataset["genre"] == tag]
small[small["country"] == "United States of America"][["name", "country","year"]].head(10)

**Conclusion:**

   - We can see that most of the movies are R and PG-13 rated, and that most movies are from Adventure, Action and Comedy genres.
   - G rated movies are mostly family ones! (as expected!)

**Correlation analysis**

In [None]:
sns.heatmap(dataset.corr(), annot = True, linewidths=.5, cmap='cubehelix')
plt.title('Correlation', fontsize = 20)
plt.gcf().set_size_inches(15, 7)
plt.show()

*There are clear correlation between 'Budget' and 'Gross', and also a relationship between the 'Vote' and 'Gross' variables*

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey = True)

plt.gcf().set_size_inches(15, 7)
ax1.scatter(dataset.budget, dataset.gross, c = 'green')
ax1.set_title('Budget vs. Gross', c = 'green', fontsize = 25)
ax2.scatter(dataset.votes, dataset.gross, c='red')
ax2.set_title('Votes vs. Gross', c ='red', fontsize = 25)

plt.ylabel('Gross', fontsize = 25)

plt.show()

**Conclusion:**

- Low budget movies and low voted movies all seem to have poor profit.
- As the budget raises, there is an exponencial tendency for gross improvement.
- There is no clear relation in how much a movie profits from the amount of votes it has.

**Actors and directors**

Let's use some wordclouds to see what happens at the star and director columns!

In [None]:
plt.subplots(figsize=(12,8))
wordcloud = WordCloud(
                          background_color='Black',
                          width=1920,
                          height=1080
                         ).generate(" ".join(dataset.star))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

In [None]:
plt.subplots(figsize=(12,8))
wordcloud = WordCloud(
                          background_color='White',
                          width=1920,
                          height=1080
                         ).generate(" ".join(dataset.director))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

**Let's see how runtime and score are distributed**

*Runtime*

In [None]:
x1 = dataset['runtime'].fillna(0.0).astype(float)
fig = ff.create_distplot([x1], ['Runtime'], bin_size=0.7, curve_type='normal', colors=["#6ad49b"])
fig.update_layout(title_text='Runtime with normal distribution')
fig.show()

*Score*

In [None]:
x2 = dataset['score'].fillna(0.0).astype(float)
fig = ff.create_distplot([x2], ['Score'], bin_size=0.1, curve_type='normal', colors=["#6ad49b"])
fig.update_layout(title_text='Score with normal distribution')
fig.show()

**Conclusion:**

   - Runtime almost follows a normal distribution around 100 min of duration, but has little skewness to the left.
   - Score values follow a normal distribution, around 6.4.    

Thanks for reaching the end! Upvote if you liked it!