![example](images/director_shot.jpeg)

# Project Title

**Authors:** Maree Marinelis
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [83]:
# Here you run your code to explore the data
import pandas as pd


df1 = pd.read_csv('zippedData/bom.movie_gross.csv.gz')

#print (df1.head())
#print (df1.tail())
#print (df1.shape)
#df1.duplicated().value_counts()
#df1.sort_values
#df1.iloc[10]
#df1.info()

#df1.dtypes

print('df1',df1.columns)

df2 = pd.read_csv('zippedData/tmdb.movies.csv.gz')

print('df2',df2.columns)

df3 = pd.read_csv('zippedData/rt.movie_info.tsv.gz', sep='\t', skiprows=[6])  # Skip line 7 (index 6)

print('df3', df3.columns)

#df1_2 = pd.merge(df1, df2, on='title')

#print('df1_2',df1_2.columns)

#df1_3 = pd.merge(df1, df3, on='studio')

df4 = pd.read_csv('zippedData/rt.reviews.tsv.gz', sep='\t', skiprows=[4], encoding='latin-1')  # Use 'latin-1' encoding

print('df4', df4.columns)

#df1_4 = pd.merge(df1_3, df4, on='rating')


df6 = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')

print('df6', df6.columns)


print(df1['domestic_gross'].dtype)
print(df6['domestic_gross'].dtype)



#link domestic and foreign sales with rating/vote average, genre, 



df1 Index(['title', 'studio', 'domestic_gross', 'foreign_gross', 'year'], dtype='object')
df2 Index(['Unnamed: 0', 'genre_ids', 'id', 'original_language', 'original_title',
       'popularity', 'release_date', 'title', 'vote_average', 'vote_count'],
      dtype='object')
df3 Index(['id', 'synopsis', 'rating', 'genre', 'director', 'writer',
       'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime',
       'studio'],
      dtype='object')
df4 Index(['id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher',
       'date'],
      dtype='object')
df6 Index(['id', 'release_date', 'movie', 'production_budget', 'domestic_gross',
       'worldwide_gross'],
      dtype='object')
float64
object


## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [84]:
# Here you run your code to clean the data


#df6['domestic_gross']
df6['domestic_gross'] = df6['domestic_gross'].str.replace('[^\d.]', '', regex=True).astype(float)

merged_df16 = pd.merge(df1, df6, left_on='title', right_on='movie')
#print(merged_df16)

#print(merged_df16[['domestic_gross_x','domestic_gross_y']])

#sorting values on domestic_gross_y in descending order to see which films were the most successful. 
merged_df16.sort_values(by='domestic_gross_y', ascending=False)

#The top 5 films are Black Panther, Avengers: Infinity War, Jurassic World, Incredibles 2, Rogue One: A Star Wars Story
#looking at row 598, 602. All domestic grosses don't add up
#Data Collection Period: The data might have been collected at different times, and one might include more recent figures than the other.
# Data Sources: The data might be sourced from different databases or platforms, which might have different reporting standards or accuracy.
# Data Processing: There might be differences in how gross revenues were calculated, adjusted, or reported in the original datasets.
# Missing or Incorrect Entries: There might be inaccuracies, missing entries, or errors in one or both datasets.


#next will look at release date and year columns to see which years released the most movies. Maybe look at most highly rated movies?
#next can look at correlation of production budget and success financially domestically and worldwide




df6['release_date'] = pd.to_datetime(df6['release_date'], errors='coerce')
df6['release_year'] = df6['release_date'].dt.year

# Extracting the release year from the 'release_date' column in df6 (tn_movie_budgets)
df6['release_year'] = df6['release_date'].dt.year


# Counting the number of movies released each year in both datasets (df1 and df6)
movie_count_df1 = df1['year'].value_counts().reset_index().rename(columns={'index': 'year', 'year': 'count_df1'})
movie_count_df6 = df6['release_year'].value_counts().reset_index().rename(columns={'index': 'year', 'release_year': 'count_df6'})

# Merging the count data and summing up for total releases per year
movie_count_merged = pd.merge(movie_count_df1, movie_count_df6, how='outer', on='year').fillna(0)
movie_count_merged['total_releases'] = movie_count_merged['count_df1'] + movie_count_merged['count_df6']

# Sorting by 'total_releases' to find the years with most releases
top_years_releases_merged = movie_count_merged.sort_values(by='total_releases', ascending=False).head()

top_years_releases_merged


#look at highly rated films. Show the highest voted ones and compare to the vote count. Set mean for number of vote counts




# print(merged_df16[['foreign_gross','worldwide_gross']])



# ## Splitting the 'genre' column in df3 and exploring the number of movies per genre
# # Expanding the split genres into separate rows
# exploded_genres_df3 = df3.assign(genre=df3['genre'].str.split('|')).explode('genre')

# # Counting the number of movies per genre
# movies_per_genre_df3 = exploded_genres_df3['genre'].value_counts().reset_index().rename(columns={'index': 'genre', 'genre': 'movie_count'})

# # Displaying the top genres by movie count
# movies_per_genre_df3.head()

Unnamed: 0,year,count_df1,count_df6,total_releases
0,2015,450.0,338,788.0
1,2016,436.0,219,655.0
3,2011,399.0,254,653.0
4,2014,395.0,255,650.0
2,2012,400.0,235,635.0


## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***