## The Data Preparation

The data is from [MovieTweetings Data](https://github.com/sidooms/MovieTweetings/tree/master/recsyschallenge2014).  To get started, you can read more about this project and the dataset from the [publication here](http://crowdrec2013.noahlab.com.hk/papers/crowdrec2013_Dooms.pdf).


In [1]:
import numpy as np
import pandas as pd

# Read in the MovieTweetings dataset originally taken from https://github.com/sidooms/MovieTweetings/tree/master/latest
movies = pd.read_csv('movies.dat', delimiter='::', header=None, names=['movie_id', 'movie', 'genre'], dtype={'movie_id': object}, engine='python')
reviews = pd.read_csv('ratings.dat', delimiter='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'], dtype={'movie_id': object, 'user_id': object, 'timestamp': object}, engine='python')


## For each of the datasets, there are a couple of cleaning steps we need to take care of:

### Movies
* Pulling the date from the title and creating new column
* Dummy the date column with 1's and 0's for each century of a movie (1800's, 1900's, and 2000's)
* Dummy column the genre with 1's and 0's

### Reviews
* Creating a date out of time stamp


In [2]:
def prepare_data(movies, reviews):
    '''
    A function for preparing and cleaning the movies and reviews dataframe performing the following tasks:
    - Pulling the date from the title and creating new column
    - Dummy the date column with 1's and 0's for each century of a movie (1800's, 1900's, and 2000's)
    - Dummy column the genre with 1's and 0's
    - Creating a date out of time stamp in reviews
    @param: movies and reviews dataframe Not cleaned
    @return: cleaned movies and reviews dataframe
    '''
    # Pulling the date from the title and create new column
    movies['year'] = movies.movie.apply(lambda x: x[-5:-1]
                                        if x[-1] == ')' else np.nan)
    movies['date'] = movies.movie.apply(lambda x: x[-5:-1]
                                        if x[-1] == ')' else np.nan)

    
    def dates(x):
        '''
        A function that return 1 if date is a given century or 0 if else 
        '''
        if x[:2] == yr:
            return 1
        else:
            return 0

    # Loop for mapping the dumming function with respect to the three centuries    
    for yr in ['18', '19', '20']:
        movies[yr + "00's"] = movies.year.apply(dates)

        
    # empty list to be filled with genres    
    genres = []

    # a loop for updating the genres list with the splitted values from the genre column
    for i in movies.genre:
        try:
            genres.extend(i.split('|'))
        except AttributeError:
            pass

    # Removing duplicated from the list    
    genres = set(genres)

    def gens(x):
        '''
        A function for searching for a certain genre in the genre columns row by row
        if found return 1 else return 0
        '''
        try:
            if x.find(i) > -1:
                return 1
            else:
                return 0
        except AttributeError:
            pass
    
    # loop for mapping the gens function over the genres collected
    for i in genres:
        movies[i] = movies.genre.apply(gens)
        
    import datetime
    # Creating a date out of timestamp
    reviews['date'] = reviews.timestamp.apply(lambda x: datetime.datetime.fromtimestamp(int(x)))
    
    return movies, reviews

In [3]:
movies, reviews = prepare_data(movies, reviews)

In [4]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date
0,1,114508,8,1381006850,2013-10-05 22:00:50
1,2,208092,5,1586466072,2020-04-09 22:01:12
2,2,358273,9,1579057827,2020-01-15 03:10:27
3,2,10039344,5,1578603053,2020-01-09 20:50:53
4,2,6751668,9,1578955697,2020-01-13 22:48:17


In [5]:
movies.head()

Unnamed: 0,movie_id,movie,genre,year,date,1800's,1900's,2000's,Music,Sci-Fi,...,News,Adventure,Horror,Thriller,Mystery,Western,Fantasy,Drama,Romance,Short
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1894,1,0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1895,1,0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1896,1,0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1895,1,0,0,,,...,,,,,,,,,,
4,91,Le manoir du diable (1896),Short|Horror,1896,1896,1,0,0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [6]:
reviews.to_csv('./reviews_clean.csv')
movies.to_csv('./movies_clean.csv')