### Data Story with MovieLens 100K

#### Author: Aritra Chattaraj

#### Mentor: Sumit Dutta

#### Capstone Project One

#### Summary

The idea of this mini project is to document all my visuals and exploratory findings on the MovieLens 100K data and to build hypotheses that can be explored. These hypotheses will help us shape our data in a consumable format for the end product which will be a recommendation system.

The first few parts of this document will overlap with the data wrangling mini project and beyond that, we have data exploration with visuals to find out any insights, correlations which can be used to tailor our hypotheses.

Finally, we end with a data story regarding our data which will explain to you the potential within the data that is to be explored and investigated.

#### Table of Contents

1.[Introduction](#intro)

2.[Data Reading & Viewing](#datareading)

3.[Data Cleaning](#datacleaning)

4.[Missing Values & Outliers](#missingvalues)

5.[Exploratory Data Analysis](#eda)

6.[Insights & Correlations](#insights)

7.[Hypotheses](#hypotheses)

8.[Data Story](#datastory)

9.[Conclusion](#conclusion)

#### 1. Introduction <a id = 'intro'></a>

Lorem Ipsum about data set, etc.

#### 2. Data Reading & Viewing <a id = 'datareading'></a>

In [2]:
# Importing relevant libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Loading the three data parts

# Reading user data

u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users_df = pd.read_csv('../data/u.user', sep = '|', names = u_cols,
                       encoding = 'latin1')

# Reading ratings data

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings_df = pd.read_csv('../data/u.data', sep = '\t', names = r_cols,
                        encoding = 'latin1')

# Reading movies data

m_cols = item_cols = ['movie_id', 'movie_title', 'release_date',
                      'video_release_date', 'imdb_url',
                      'unknown', 'action', 'adventure',
                      'animation', 'childrens', 'comedy',
                      'crime', 'documentary', 'drama',
                      'fantasy', 'film-noir', 'horror',
                      'musical', 'mystery', 'romance',
                      'sci-fi', 'thriller', 'war', 'western']
movies_df = pd.read_csv('../data/u.item', sep = '|', names = m_cols,
                        encoding = 'latin1')

In [4]:
users_df.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [5]:
ratings_df.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [6]:
movies_df.head()

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,imdb_url,unknown,action,adventure,animation,childrens,...,fantasy,film-noir,horror,musical,mystery,romance,sci-fi,thriller,war,western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


#### 3. Data Cleaning <a id = 'datacleaning'></a>

In [7]:
# First join

new_df_one = pd.merge(movies_df, ratings_df, on = 'movie_id')

In [8]:
# Final join

final_train_df = pd.merge(new_df_one, users_df, on = 'user_id')

In [9]:
final_train_df.head()

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,imdb_url,unknown,action,adventure,animation,childrens,...,thriller,war,western,user_id,rating,unix_timestamp,age,sex,occupation,zip_code
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,308,4,887736532,60,M,retired,95076
1,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,308,5,887737890,60,M,retired,95076
2,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,1,0,0,308,4,887739608,60,M,retired,95076
3,7,Twelve Monkeys (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Twelve%20Monk...,0,0,0,0,0,...,0,0,0,308,4,887738847,60,M,retired,95076
4,8,Babe (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Babe%20(1995),0,0,0,0,1,...,0,0,0,308,5,887736696,60,M,retired,95076
