## Data preparation, basic statistics and visualization

Data mining, assignment, `28.3.2022.`  
**`Izabela Lesac`**

An inevitable part of every project in the field of data mining is searching for,
editing and preparing data. In this task, you will get acquainted with a dataset and use procedures for converting data into the appropriate format and do
overview and display of basic statistics.

### Data

In the task you will review and prepare Hollywood movie ratings from
the [MovieLens](https://grouplens.org/datasets/movielens/) collection from the period **1995-2016**.

The same data is used in all assignments, so you should get to know the data well. This is a database for
evaluating recommendations systems that include viewers and their ratings on a scale of 1 to 5.
In addition to the basic user and rating matrix, it includes also movie information (e.g., genre, date, tags, players).

The dataset is in folder `./podatki/ml-latest-small`. The database contains the following files:

* ratings.csv: user data and ratings,
* movies.csv: movie genre information,
* cast.csv: player information,
* tags.csv: tag information (\emph{tags}),
* links.csv: links to related databases.

Before starting to solve the task, take a good look at the data and read the **README.txt** file. You can learn about the details on the [website](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html).

Prepare methods for loading data into the appropriate data structures. They will come in handy
also for further tasks.
Pay attention to the size of the data.

Write down the code to read the files and prepare the appropriate matrices (and other structures) of the data that you will use to answer the questions below.

You can split the code into multiple cells.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import Orange

In [None]:
tableRatings = Orange.data.Table('podatki/ml-latest-small/ratings.csv')
tableMovies = Orange.data.Table('podatki/ml-latest-small/movies.csv')
tableCast = Orange.data.Table('podatki/ml-latest-small/cast.csv')

### Questions

The main purpose of data mining is *knowledge discovery from data*,
i.e., answering questions using computational approaches.

By using the principles you have learned on the exercises and lectures, answer
the questions below. For each question, think carefully about the way you will
best give, show or justify the answer. The essential part is the answers to
questions rather than the implemention of your solution.

#### Question 1 (15%):
Which movies are the best on average? Prepare a list of
movies and their average ratings and print 10 movies from the top of the list.
Do you see any problems with such an assessment? How could you solve it? What are they?
results of that?

You can split the code into multiple cells.

In [None]:
avgRatings = []
for movie in tableMovies:
    
    ratingMovieId = Orange.data.filter.SameValue(tableRatings.domain['movieId'],movie['movieId'])
    movieRating = ratingMovieId(tableRatings)
    movieTitle = str(movie['title'])
      
    ratingsN = 0
    sumR = 0
    
    for rating in movieRating:
        sumR += float(rating[2])
        ratingsN += 1
        
    if ratingsN!=0:
        avgRating = str(sumR/ratingsN)
    else:
        avgRating = 0
    avgRatings.append([movieTitle,ratingsN,avgRating])

In [None]:
avgRatingsSorted = sorted(avgRatings, key=lambda movie: float(movie[2]), reverse=True)
print(avgRatingsSorted[:10])

Answer: This assessment gives us the average top rated movies. However, it does not account for the number of ratings each movie has. So it is not very accurate, because the movies have been rated highly but only once. We could fix this by saying each movie has to have at least x number of ratings to be considered. By adding this requirement we make sure our results are not biased towards movies with a low number of users rating them.

In [None]:
avgRatingsNew = []
for movie in avgRatingsSorted:
    if movie[1] > 100:
        avgRatingsNew.append(movie)
print(avgRatingsNew[:10])

Answer: Now we only consider movies which have more than 100 ratings. This gives us a more accurate view of top rated movies. 

#### Question 2 (15%):
Each film belongs to one or more genres.
How many genres are there? Show the distribution of genres using appropriate
visualization.

You can split the code into multiple cells.

In [None]:
genreList = []
for movie in tableMovies:
    movieGenres = str(movie["genres"]).split("|")
    for genre in movieGenres:
        if (genre not in genreList):
            genreList.append(genre)
print(genreList)
len(genreList)

Answer: There are 20 genres, but some movies have no genre listed which also counts as a genre in this dataset.

In [None]:
from csv import DictReader

genresdict = {}
readerMovies = DictReader(open('podatki/ml-latest-small/movies.csv','rt', encoding='utf-8'))
for row in readerMovies:
    movieM = row["movieId"]
    titleM = row["title"]
    genresM = row["genres"]
    
    single = genresM.split("|")
    for genre in single:
        if genre not in genresdict:
            genresdict.update({genre:1})
        else:
            genresdict[genre] += 1

plt.bar(genresdict.keys(), genresdict.values(), align='center', width=0.5,color='magenta')
plt.xticks(rotation = 90)
plt.xlabel('genres')
plt.ylabel('movies')
plt.show()

#### Question 3 (20%):
The number of ratings is different for each film. But is there a relationship between the number of ratings and the average movie rating? Describe the procedure that you used to answer the question.

You can split the code into multiple cells.

In [None]:
avgRatingsList = []
numberRaitings = []

for movie in tableMovies:
    ratingMovieId = Orange.data.filter.SameValue(tableRatings.domain['movieId'],movie['movieId'])
    movieRating = ratingMovieId(tableRatings)
    movieTitle = str(movie['title'])
      
    ratingsN = 0
    sumR = 0
    
    for rating in movieRating:
        sumR += float(rating[2])
        ratingsN += 1
        
    if ratingsN!=0:
        avgRating = sumR/ratingsN
    else:
        avgRating = 0
    
    avgRatingsList.append(avgRating)
    numberRaitings.append(ratingsN)

In [None]:
plt.scatter(avgRatingList, raitingList, color='magenta')
plt.xlabel('average rating of a movie')
plt.ylabel('# of ratings')
plt.show()

Answer: Yes, there is a correlation between the number of ratings and the average rating of a movie. Films that are rated better have more ratings. In other words, people tend to watch movies that have good ratings and avoid movies with low raitings.

#### Question 4 (30%):
Each rating was entered on a specific date (column
*timestamp*). Does the popularity of individual films change over time?
Solve the problem by allocating ratings for a given film by time and at any timepoint calculate the average for the last 30, 50, or 100 ratings. Draw a graph, how the rating changes and show it for two interesting examples of movies.

You can split the code into multiple cells.

In [None]:
import datetime

ratings1 = []
ratings2 = []

movie1 = 3262
movie2 = 4437

ratingMovieId1 = Orange.data.filter.SameValue(tableRatings.domain['movieId'],movie1)
movie1 = ratingMovieId1(tableRatings)
for rating in movie1:
    date = datetime.datetime.utcfromtimestamp(rating[3])
    ratings1.append([rating[2],date])

ratingMovieId2  = Orange.data.filter.SameValue(tableRatings.domain['movieId'],movie2)
movie2 = ratingMovieId2(tableRatings)
for rating in movie2:
    date = datetime.datetime.utcfromtimestamp(rating[3])
    ratings2.append([rating[2],date])

Answer: **write down the answer and explain it**

#### Question 5 (20%):
How would you rate the popularity of individual actors? Describe the procedure
 for evaluating and print the 10 most popular actors.

You can split the code into multiple cells.

In [None]:
avgRatings = []
for movie in tableMovies:
    
    ratingMovieId = Orange.data.filter.SameValue(tableRatings.domain['movieId'],movie['movieId'])
    movieRating = ratingMovieId(tableRatings)
    movieTitle = str(movie['title'])
      
    ratingsN = 0
    sumR = 0
    
    for rating in movieRating:
        sumR += float(rating[2])
        ratingsN += 1
        
    if ratingsN!=0:
        avgRating = str(sumR/ratingsN)
    else:
        avgRating = 0
    avgRatings.append([movieTitle,ratingsN,avgRating])
    
avgRatingsNew = []
for movie in avgRatingsSorted:
    if movie[1] > 100:
        avgRatingsNew.append(movie)

In [None]:
avgRatingsNew = []
for movie in avgRatingsSorted:
    if movie[1] > 100:
        avgRatingsNew.append(movie)

In [None]:
actors = {}
for actors in tableCast:
    actorList = str(actors['cast']).split('|')
    moviesActors =  Orange.data.filter.SameValue(tableMovies.domain['movieId'], actors["movieId"])
    moviesCast = moviesActors(tableMovies)
            


Answer: One way would be to get a list of actors and the number of movies they acted in. The more movies - the more popular an actor. However, this is flawed and doesn't quite tell us how popular an actor is, just that they act a lot. We should also take into account the rating of the movie. 

#### Bonus question (5%):

What's your favorite movie? Why?

Answer: John Carpenter’s The Thing because of its incredbile practical effects

### Notes
You can use the built-in `csv` module to load data.

In [None]:
from csv import DictReader

reader = DictReader(open('podatki/ml-latest-small/ratings.csv', 'rt', encoding='utf-8'))
for row in reader:
    user = row["userId"]
    movie = row["movieId"]
    rating = row["rating"]
    timestamp = row["timestamp"]

Data in the last line of the file:

In [None]:
user, movie, rating, timestamp

Convert the time format (*Unix time*). Code about the structure is listed in documentation of the module  [`datetime`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).

In [None]:
from datetime import datetime

t = 1217897793 # Unix-time
ts = datetime.fromtimestamp(t).strftime('%Y-%m-%d %H:%M')
ts