# Example for data science
Purpose of this exercise is download the dataset, save it to a database and answer few questions.
<br />
Dataset source <link>https://files.grouplens.org/datasets/movielens/ml-latest-small.zip</link>

### Data preparation 
1. Download dataset
2. Unzip dataset
3. Load the data
4. Save it to database

### Question to be answered : 
1. How many movies are in data set ?
2. What is the most common genre of movie?
3. What are top 10 movies with highest rate ?
4. What are 5 most often rating users ?
5. When was done first and last rate included in data set and what was the rated movie tittle?
6. Find all movies released in 1990

### Downloading dataset

In [1]:
import os
import requests

if os.path.exists('data') == False:
    os.mkdir('data')

url = 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
r = requests.get(url, allow_redirects=True, verify=False)
open('data/ml-latest-small.zip', 'wb').write(r.content)



978202

### Unzip dataset

In [2]:
import zipfile

with zipfile.ZipFile('data/ml-latest-small.zip', 'r') as zip_ref:
    zip_ref.extractall('data/unzipped')

### Load the data
While loading data, we remove NAN values to have a pure dataset

In [3]:
import pandas as pd
import numpy as np

links = pd.read_csv("data/unzipped/ml-latest-small/links.csv").dropna()
movies = pd.read_csv("data/unzipped/ml-latest-small/movies.csv").dropna()
ratings = pd.read_csv("data/unzipped/ml-latest-small/ratings.csv").dropna()
tags = pd.read_csv("data/unzipped/ml-latest-small/tags.csv").dropna()

In [4]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


### Save it to database

In [8]:
from sqlalchemy import create_engine
import pymysql
from sqlalchemy.types import Integer, Text, String, DateTime, Float

userName = "root"
password = "password"
ip = "mysql"
port = "3306"

engine = create_engine(f'mysql+pymysql://{userName}:{password}@{ip}:{port}')


#### Lets create database if not exists
<br />
<br />

In [9]:
dbName = "exercise"
engine.execute(f"CREATE DATABASE IF NOT EXISTS {dbName};")
engine = create_engine(f'mysql+pymysql://{userName}:{password}@{ip}:{port}/{dbName}') # engine recreated for simplycity

In [10]:
movies.to_sql(
    'movies',
    engine,
    if_exists='replace',
    index=False,
    chunksize=500,
    dtype={
        "movieId": Integer,
        "title": Text,
        "genres": Text
    }
)

In [11]:
links.to_sql(
    'links',
    engine,
    if_exists='replace',
    index=False,
    chunksize=500,
    dtype={
        "movieId": Integer,
        "imdbId": Integer,
        "tmdbId": Float
    }
)

In [12]:
ratings.to_sql(
    'ratings',
    engine,
    if_exists='replace',
    index=False,
    chunksize=500,
    dtype={
        "userId": Integer,
        "movieId": Integer,
        "rating": Float,
        "timestamp": Integer
    }
)

In [13]:
tags.to_sql(
    'tags',
    engine,
    if_exists='replace',
    index=False,
    chunksize=500,
    dtype={
        "userId": Integer,
        "movieId": Integer,
        "tag": Text,
        "timestamp": Integer
    }
)

### Load from database
For the sake of the exercise, we load data from database.

In [14]:
movies = pd.read_sql_table(
    'movies',
    con=engine
)
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [15]:
links = pd.read_sql_table(
    'links',
    con=engine
)
links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9734 entries, 0 to 9733
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9734 non-null   int64  
 1   imdbId   9734 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.3 KB


In [16]:
ratings = pd.read_sql_table(
    'ratings',
    con=engine
)
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [17]:
tags = pd.read_sql_table(
    'tags',
    con=engine
)
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


## Questions and Answers

### Question 1) How many movies are in data set ?

In [18]:
numberOfDistinctMovieTitles = len(movies['title'].dropna().unique())
print(f'Number of distinct movie titles is {numberOfDistinctMovieTitles}')

Number of distinct movie titles is 9737


### Question 2) What is the most common genre of movie?

In [19]:
genres = movies['genres'].str.split(pat="|")
counter = {}

for genreList in genres:
    for genre in genreList:
        if genre not in counter:
            counter[genre] = 1
        else:
            counter[genre] += 1
counter

{'Adventure': 1263,
 'Animation': 611,
 'Children': 664,
 'Comedy': 3756,
 'Fantasy': 779,
 'Romance': 1596,
 'Drama': 4361,
 'Action': 1828,
 'Crime': 1199,
 'Thriller': 1894,
 'Horror': 978,
 'Mystery': 573,
 'Sci-Fi': 980,
 'War': 382,
 'Musical': 334,
 'Documentary': 440,
 'IMAX': 158,
 'Western': 167,
 'Film-Noir': 87,
 '(no genres listed)': 34}

In [20]:
import operator

mostCommonGenre = max(counter.items(), key=operator.itemgetter(1))[0]

print(f'Most common genre is {mostCommonGenre}')

Most common genre is Drama


### Question 3) What are top 10 movies with highest rate ?

In [21]:
movies.set_index('movieId').join(ratings.set_index('movieId')).groupby('title').mean('rating').sort_values(by=['rating'], ascending=False)[:10]['rating']

title
Gena the Crocodile (1969)                    5.0
True Stories (1986)                          5.0
Cosmic Scrat-tastrophe (2015)                5.0
Love and Pigeons (1985)                      5.0
Red Sorghum (Hong gao liang) (1987)          5.0
Thin Line Between Love and Hate, A (1996)    5.0
Lesson Faust (1994)                          5.0
Eva (2011)                                   5.0
Who Killed Chea Vichea? (2010)               5.0
Siam Sunset (1999)                           5.0
Name: rating, dtype: float64

### Question 4) What are 5 most often rating users ?
'Most often' is a hard question to answer but 'most is simple

In [22]:
ratings['userId'].value_counts()[:10]

414    2698
599    2478
474    2108
448    1864
274    1346
610    1302
68     1260
380    1218
606    1115
288    1055
Name: userId, dtype: int64

### Question 5) When was done first and last rate included in data set and what was the rated movie tittle?

In [23]:
first = ratings[ratings.timestamp == ratings.timestamp.min()].head(1)
last = ratings[ratings.timestamp == ratings.timestamp.max()].head(1)

In [24]:
first

Unnamed: 0,userId,movieId,rating,timestamp
66662,429,22,4.0,828124615


In [25]:
last

Unnamed: 0,userId,movieId,rating,timestamp
81092,514,162,4.0,1537799250


In [26]:
firstMovieRated = movies[movies.movieId == first.movieId.values[0]]
firstMovieRated

Unnamed: 0,movieId,title,genres
21,22,Copycat (1995),Crime|Drama|Horror|Mystery|Thriller


In [27]:
lastMovieRated = movies[movies.movieId == last.movieId.values[0]]
lastMovieRated

Unnamed: 0,movieId,title,genres
135,162,Crumb (1994),Documentary


In [28]:
print(f'First movie rated {firstMovieRated["title"].values[0]} {first.timestamp.values[0]}')
print(f'Last movie rated {lastMovieRated["title"].values[0]} {last.timestamp.values[0]}')

First movie rated Copycat (1995) 828124615
Last movie rated Crumb (1994) 1537799250


### Question 6) Find all movies released in 1990
Only place that we can receive movie release dates is in movies table's title column

In [29]:
moviesFrom1990 = []
for title in movies["title"].values:
    year = title[-5:-1]
    if year == '1990':
        moviesFrom1990.append(title)
moviesFrom1990

['Home Alone (1990)',
 'Ghost (1990)',
 'Dances with Wolves (1990)',
 'Pretty Woman (1990)',
 'Days of Thunder (1990)',
 'Grifters, The (1990)',
 'Tie Me Up! Tie Me Down! (¡Átame!) (1990)',
 'Paris Is Burning (1990)',
 'Goodfellas (1990)',
 'Trust (1990)',
 'Rosencrantz and Guildenstern Are Dead (1990)',
 "Miller's Crossing (1990)",
 'Femme Nikita, La (Nikita) (1990)',
 'Pump Up the Volume (1990)',
 'Cyrano de Bergerac (1990)',
 'Amityville Curse, The (1990)',
 'Die Hard 2 (1990)',
 'Young Guns II (1990)',
 'Marked for Death (1990)',
 'Hunt for Red October, The (1990)',
 'King of New York (1990)',
 'Metropolitan (1990)',
 "Child's Play 2 (1990)",
 'Exorcist III, The (1990)',
 'Gremlins 2: The New Batch (1990)',
 'Back to the Future Part III (1990)',
 'Godfather: Part III, The (1990)',
 'Rescuers Down Under, The (1990)',
 'NeverEnding Story II: The Next Chapter, The (1990)',
 'My Blue Heaven (1990)',
 'Sheltering Sky, The (1990)',
 'Edward Scissorhands (1990)',
 'Tales from the Darkside