# Preparation

## Research Question (subject to change)
What are the different options of a computer science career path?\
What proportion of Software developers got a 4-year degree? Masters? Went to bootcamp?

Perhaps explore movie ratings dataset from https://grouplens.org/datasets/movielens/


## Initial Data Exploration

In [1]:
import pandas as pd
import plotly.express as px

In [2]:
ratings = pd.read_csv("ml-25m/ratings.csv")
movies = pd.read_csv("ml-25m/movies.csv")

We will start simple by observing the most highly rated movies

In [8]:
movie_ratings = ratings.groupby(by="movieId").agg(["mean", "count"])
movie_ratings.columns = movie_ratings.columns.get_level_values(1) + '_' + movie_ratings.columns.get_level_values(0)
movie_ratings.reset_index(inplace=True)
movie_ratings = movie_ratings[["movieId", "mean_rating", "count_rating"]].rename(columns = {
    "mean_rating": "rating",
    "count_rating": "count"
})

movie_ratings = pd.merge(movie_ratings, movies, on="movieId")

movie_ratings.sort_values(by=["rating", "count"], ascending=False, inplace=True)
movie_ratings.head(10)

Unnamed: 0,movieId,rating,count,title,genres
0,118268,5.0,3,Borrowed Time (2012),Drama
1,148298,5.0,3,Awaken (2013),Drama|Romance|Sci-Fi
2,165787,5.0,3,Lonesome Dove Church (2014),Western
3,179731,5.0,3,Sound of Christmas (2016),Drama
4,133297,5.0,2,Genius on Hold (2013),(no genres listed)
5,137853,5.0,2,El camino (2008),Drama
6,139547,5.0,2,Placebo: Soulmates Never Die: Live in Paris 20...,(no genres listed)
7,140369,5.0,2,War Arrow (1954),Adventure|Drama|Romance|War|Western
8,140377,5.0,2,About Sarah,Drama
9,143422,5.0,2,2 (2007),Drama


As we can see above, simply using average rating as the metric for research will not provide an accurate picture. This would imply that Borrowed Time (2012), reviewed by a sum total of 3 people in this dataset, is the greatest movie of all time.

However, with some extra research, we found that it has a score of 5.9/10 on IMdB out of 371 ratings, and 3.2/5 on Letterboxd out of 59 ratings.

In [9]:
movie_ratings.sort_values(by=["count", "rating"], ascending=False, inplace=True)
movie_ratings.head(10)

Unnamed: 0,movieId,rating,count,title,genres
2566,356,4.048011,81491,Forrest Gump (1994),Comedy|Drama|Romance|War
1532,318,4.413576,81482,"Shawshank Redemption, The (1994)",Crime|Drama
2038,296,4.188912,79672,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
2209,593,4.151342,74127,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller
2201,2571,4.154099,72674,"Matrix, The (1999)",Action|Sci-Fi|Thriller
2336,260,4.120189,68717,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
10105,480,3.679175,64144,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller
1988,527,4.247579,60411,Schindler's List (1993),Drama|War
2742,110,4.002273,59184,Braveheart (1995),Action|Drama|War
1994,2959,4.228311,58773,Fight Club (1999),Action|Crime|Drama|Thriller


Ranking films primarily based on the metric of popularity (How many reviews it has received) seems to provide a more reliable outcome.

Let's now examine the relationship between popularity and rating

In [11]:
px.scatter(movie_ratings, x="count", y="rating", opacity=.5)

We now have our first question at hand: What number of reviews is required for an average rating to accurately reflect the quality of the movie, devoid of personal biases such as a friend as the director?

In [19]:
movie_ratings_filtered = movie_ratings[movie_ratings["count"] >= 100]
movie_ratings_filtered.sort_values("rating", ascending=False).head(10)

Unnamed: 0,movieId,rating,count,title,genres
1524,171011,4.483096,1124,Planet Earth II (2016),Documentary
1525,159817,4.464797,1747,Planet Earth (2006),Documentary
1532,318,4.413576,81482,"Shawshank Redemption, The (1994)",Crime|Drama
1534,170705,4.398599,1356,Band of Brothers (2001),Action|Drama|War
1632,171495,4.326715,277,Cosmos,(no genres listed)
1633,858,4.324336,52498,"Godfather, The (1972)",Crime|Drama
1644,179135,4.289833,659,Blue Planet II (2017),Documentary
1648,50,4.284353,55366,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
1651,198185,4.267361,288,Twin Peaks (1989),Drama|Mystery
1652,1221,4.261759,34188,"Godfather: Part II, The (1974)",Crime|Drama


Interestingly enough, Many of the less popular movies that still rated highly are documentaries.

Using the README.txt provided, we manually created a "genres.txt" file that contains a list of genres, where each genre is on a separate line.

In [26]:
with open("ml-25m/genres.txt") as file:
    genres = [line.rstrip("\n") for line in file]
print(genres)

['Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western', '(no genres listed)']


In [36]:
genre_avgs = []
for genre in genres:
    score = movie_ratings_filtered[movie_ratings_filtered["genres"].str.contains(genre)].mean()
    genre_avgs.append(score["rating"])

genre_scores = pd.DataFrame(list(zip(genres, genre_avgs)), columns=["genre", "avg_rating"])
genre_scores


Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.


This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.



Unnamed: 0,genre,avg_rating
0,Action,3.15822
1,Adventure,3.220105
2,Animation,3.385723
3,Children's,
4,Comedy,3.19402
5,Crime,3.365292
6,Documentary,3.658601
7,Drama,3.465981
8,Fantasy,3.232622
9,Film-Noir,3.743389
