The project analyzes a dataset and performs web scraping.

The Ratings class analyzes ratings.csv:
* examines movie ratings, who rates them, and when

The Tags class processes tags.csv:
* what tags users assign, which are most popular, longest, word-rich, and interesting

The Movies class examines movies.csv:
* Movies, genres, titles

And the Links class scrapes the website + processes links.csv
* loads website data and implements caching
* builds methods based on scraped website data



In [84]:
from movielens_analysis import Ratings
from movielens_analysis import Tags
from movielens_analysis import Movies
from movielens_analysis import Links


print("Initializing Ratings system...")
%timeit r = Ratings('data/ml-latest-small/ratings.csv')
r = Ratings('data/ml-latest-small/ratings.csv')
print("System ready!")
print(f"Data loaded: {len(r.ratings_data)} ratings")
print('p.s. each method is limited to the first 1000 rows')

Initializing Ratings system...
15.9 ms ¬± 514 Œºs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)
System ready!
Data loaded: 1000 ratings
p.s. each method is limited to the first 1000 rows


In [85]:
# Which year had the most film productions?
print("Analyzing movie distribution by year...")
%timeit years_data = r.movies.dist_by_year()
years_data = r.movies.dist_by_year()

top_year = max(years_data, key=years_data.get)
top_count = years_data[top_year]

print(f"Most productive film year - {top_year}!")
print(f"Users gave {top_count} ratings this year")

Analyzing movie distribution by year...
813 Œºs ¬± 17.2 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
Most productive film year - 1996!
Users gave 358 ratings this year


In [86]:
# Which ratings are given most frequently?
print("Examining rating distribution...")
%timeit rating_dist = r.movies.dist_by_rating()
rating_dist = r.movies.dist_by_rating()

most_common_rating = max(rating_dist, key=rating_dist.get)
most_common_count = rating_dist[most_common_rating]

print(f"Most common rating - {most_common_rating} ‚≠ê!")
print(f"Given {most_common_count} times, wow")
print("Full distribution:")
for rating, count in sorted(rating_dist.items()):
    print(f"  {rating} ‚≠ê: {count} ratings")


Examining rating distribution...
287 Œºs ¬± 4.52 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
Most common rating - 4.0 ‚≠ê!
Given 292 times, wow
Full distribution:
  0.5 ‚≠ê: 24 ratings
  1.0 ‚≠ê: 39 ratings
  1.5 ‚≠ê: 11 ratings
  2.0 ‚≠ê: 57 ratings
  2.5 ‚≠ê: 7 ratings
  3.0 ‚≠ê: 253 ratings
  3.5 ‚≠ê: 17 ratings
  4.0 ‚≠ê: 292 ratings
  4.5 ‚≠ê: 33 ratings
  5.0 ‚≠ê: 267 ratings


In [87]:
print("Which movies are most discussed?")
%timeit top_by_count = r.movies.top_by_num_of_ratings(5)
top_by_count = r.movies.top_by_num_of_ratings(5)

print("TOP-5 movies by rating count:")
for i, (movie, count) in enumerate(top_by_count.items(), 1):
    print(f"{i}. {movie} - ratings count: {count} ")

Which movies are most discussed?
803 Œºs ¬± 21.6 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
TOP-5 movies by rating count:
1. Usual Suspects, The (1995) - ratings count: 4 
2. Pulp Fiction (1994) - ratings count: 4 
3. Fugitive, The (1993) - ratings count: 4 
4. Schindler's List (1993) - ratings count: 4 
5. Batman (1989) - ratings count: 4 


This method's output depends on the chosen metric - either mean or median. Results may differ.
This is normal.

In [88]:
print("Best movies (after Harry Potter)")
%timeit top_mean = r.movies.top_by_ratings(5, 'mean')
top_mean = r.movies.top_by_ratings(5, 'mean')

%timeit top_median = r.movies.top_by_ratings(5, 'median')
top_median = r.movies.top_by_ratings(5, 'median')

print("TOP-5 by average rating:")
for i, (movie, rating) in enumerate(top_mean.items(), 1):
    print(f"{i}. {movie}: {rating} ‚≠ê")

print("TOP-5 by median rating:")
for i, (movie, rating) in enumerate(top_median.items(), 1):
    print(f"{i}. {movie}: {rating} ‚≠ê")

Best movies (after Harry Potter)
1.97 ms ¬± 73.3 Œºs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)
2.09 ms ¬± 38.4 Œºs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)
TOP-5 by average rating:
1. Bottle Rocket (1996): 5.0 ‚≠ê
2. Canadian Bacon (1995): 5.0 ‚≠ê
3. Star Wars: Episode IV - A New Hope (1977): 5.0 ‚≠ê
4. James and the Giant Peach (1996): 5.0 ‚≠ê
5. Wizard of Oz, The (1939): 5.0 ‚≠ê
TOP-5 by median rating:
1. Bottle Rocket (1996): 5.0 ‚≠ê
2. Canadian Bacon (1995): 5.0 ‚≠ê
3. Star Wars: Episode IV - A New Hope (1977): 5.0 ‚≠ê
4. Tommy Boy (1995): 5.0 ‚≠ê
5. Forrest Gump (1994): 5.0 ‚≠ê


In [89]:
# Which movies are watched most on weekends?

print("Weekend movies")
%timeit weekend_hits = r.movies.weekend_hits(5)
weekend_hits = r.movies.weekend_hits(5)


print("TOP-5 weekend movies (weekend rating ratio):")

for i, (movie, ratio) in enumerate(weekend_hits.items(), 1):
    print(f"{i}. {movie}: {ratio*100:.1f}% ratings on weekends")

Weekend movies
1.48 ms ¬± 33.8 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
TOP-5 weekend movies (weekend rating ratio):
1. Good Will Hunting (1997): 100.0% ratings on weekends
2. Tommy Boy (1995): 67.0% ratings on weekends
3. Gladiator (2000): 67.0% ratings on weekends
4. Grumpier Old Men (1995): 50.0% ratings on weekends
5. Heat (1995): 50.0% ratings on weekends


In [90]:
print("Which movies are most controversial?")
%timeit controversial = r.movies.top_controversial(5)
controversial = r.movies.top_controversial(5)

print("TOP-5 most controversial movies (by rating variance):")
for i, (movie, variance) in enumerate(controversial.items(), 1):
    print(f"{i}. {movie}: variance {variance} üÜò")

Which movies are most controversial?
1.43 ms ¬± 37.6 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
TOP-5 most controversial movies (by rating variance):
1. Bambi (1942): variance 5.06 üÜò
2. Rescuers, The (1977): variance 5.06 üÜò
3. My Fair Lady (1964): variance 5.06 üÜò
4. Matrix, The (1999): variance 4.0 üÜò
5. Schindler's List (1993): variance 3.42 üÜò


In [91]:
# Who is the most active user?
print("Top movie critics ranking")
%timeit user_activity = r.users.dist_by_num_ratings()
user_activity = r.users.dist_by_num_ratings()

most_active_user = list(user_activity.keys())[0]
most_active_count = user_activity[most_active_user]

print(f"Most active user (ID: {most_active_user}) gave {most_active_count} ratings!")
print(f"Total active users: {len(user_activity)}")

Top movie critics ranking
215 Œºs ¬± 6.53 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
Most active user (ID: 6) gave 314 ratings!
Total active users: 7


Yes, there are only 7 users total

In [92]:
# Who are the most generous critics?
print("Checking users with highest average ratings...")
%timeit user_ratings = r.users.dist_by_avg_ratings()
user_ratings = r.users.dist_by_avg_ratings()

top_raters = dict(list(user_ratings.items())[:5])
print("TOP-5 most generous users:")
for i, (user_id, avg_rating) in enumerate(top_raters.items(), 1):
    print(f"{i}. User {user_id}: average rating {avg_rating} ‚≠ê")

Checking users with highest average ratings...
242 Œºs ¬± 17.5 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
TOP-5 most generous users:
1. User 1: average rating 4.37 ‚≠ê
2. User 2: average rating 3.95 ‚≠ê
3. User 5: average rating 3.64 ‚≠ê
4. User 4: average rating 3.56 ‚≠ê
5. User 6: average rating 3.49 ‚≠ê


In [93]:
# Who are our night monsters?
print("Finding users who rate movies at night...")
%timeit night_owls = r.users.night_monsters(5)
night_owls = r.users.night_monsters(5)

print("TOP-5 night monsters (ratings between 10 PM and 6 AM):")
for i, (user_id, ratio) in enumerate(night_owls.items(), 1):
    print(f"{i}. User {user_id}: {ratio*100:.1f}% night ratings")

Finding users who rate movies at night...
991 Œºs ¬± 36.4 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
TOP-5 night monsters (ratings between 10 PM and 6 AM):
1. User 2: 100.0% night ratings
2. User 3: 100.0% night ratings
3. User 4: 25.0% night ratings
4. User 7: 14.0% night ratings
5. User 1: 12.0% night ratings


In [94]:
# Who is the most unpredictable?
print("Finding users with most unpredictable ratings...")
%timeit controversial_users = r.users.top_controversial_users(5)
controversial_users = r.users.top_controversial_users(5)

print("TOP-5 most unpredictable users (by rating variance):")
for i, (user_id, variance) in enumerate(controversial_users.items(), 1):
    print(f"{i}. User {user_id}: variance {variance}")

Finding users with most unpredictable ratings...
514 Œºs ¬± 27 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
TOP-5 most unpredictable users (by rating variance):
1. User 3: variance 4.26
2. User 4: variance 1.72
3. User 7: variance 1.65
4. User 5: variance 0.96
5. User 6: variance 0.72


Ratings analysis complete! We learned interesting insights and are exhausted.
Enjoy watching! I recommend starting with Harry Potter :)

Moving to Links

In [95]:
print ('Initialization second...')

links = Links()

Initialization second...


In [96]:
print('Fetching data from cache:')

%timeit -r1 -n1
get_data = links.get_imdb([1, 2, 3], ['title', 'director', 'runtime'])
print(get_data)

Fetching data from cache:
[[3, '–°—Ç–∞—Ä—ã–µ –≤–æ—Ä—á—É–Ω—ã —Ä–∞–∑–±—É—à–µ–≤–∞–ª–∏—Å—å', 'Howard Deutch', 101], [2, '–î–∂—É–º–∞–Ω–¥–∂–∏', 'Joe Johnston', 104], [1, '–ò—Å—Ç–æ—Ä–∏—è –∏–≥—Ä—É—à–µ–∫', 'John Lasseter', 81]]


In [97]:
print ('Method returns top directors by number of films:')

%timeit -r1 -n1
t_directors = links.top_directors(5)
for director in t_directors:
    print(director)

Method returns top directors by number of films:
Alfred Hitchcock
Woody Allen
Frank Capra
Stanley Kubrick
Steven Spielberg


In [98]:
print('Top most expensive movies:')

%timeit -r1 -n1
m_expensive = links.most_expensive(5)
for rich in m_expensive:
    print(f'{rich}üí∏')

Top most expensive movies:
–ó–≤—ë–∑–¥–Ω—ã–µ –≤–æ–π–Ω—ã: –≠–ø–∏–∑–æ–¥ 4 - –ù–æ–≤–∞—è –Ω–∞–¥–µ–∂–¥–∞üí∏
E.T. the Extra-Terrestrialüí∏
–ö–æ—Ä–æ–ª—å –ª–µ–≤üí∏
–ü–∞—Ä–∫ —é—Ä—Å–∫–æ–≥–æ –ø–µ—Ä–∏–æ–¥–∞üí∏
–§–æ—Ä—Ä–µ—Å—Ç –ì–∞–º–øüí∏


In [99]:
print('And these are the most profitable movies:')

%timeit -r1 -n1
m_profitable = links.most_profitable(5)
for profit in m_profitable:
    print(profit)

And these are the most profitable movies:
–ü–∞—Ä–∫ —é—Ä—Å–∫–æ–≥–æ –ø–µ—Ä–∏–æ–¥–∞
–ö–æ—Ä–æ–ª—å –ª–µ–≤
–î–µ–Ω—å –Ω–µ–∑–∞–≤–∏—Å–∏–º–æ—Å—Ç–∏
E.T. the Extra-Terrestrial
–§–æ—Ä—Ä–µ—Å—Ç –ì–∞–º–ø


In [100]:
print('Longest movies?')
%timeit -r1 -n1
m_longest = links.longest(5)
print('–¢–û–ü:')
for long in m_longest:
    print(long)

Longest movies?
–¢–û–ü:
–ê–Ω–¥–µ–≥—Ä–∞—É–Ω–¥
–£–Ω–µ—Å—ë–Ω–Ω—ã–µ –≤–µ—Ç—Ä–æ–º
–¢–∞–Ω—Ü—É—é—â–∏–π —Å –≤–æ–ª–∫–∞–º–∏
–õ–æ—É—Ä–µ–Ω—Å –ê—Ä–∞–≤–∏–π—Å–∫–∏–π
–û–¥–Ω–∞–∂–¥—ã –≤ –ê–º–µ—Ä–∏–∫–µ


In [101]:
print('Most expensive movies PER MINUTE of screen time:')

%timeit -r1 -n1
t_per_min = links.top_cost_per_minute(5)
for min in t_per_min:
    print(min)

Most expensive movies PER MINUTE of screen time:
–î–µ–Ω—å –Ω–µ–∑–∞–≤–∏—Å–∏–º–æ—Å—Ç–∏
–ü–∞—Ä–∫ —é—Ä—Å–∫–æ–≥–æ –ø–µ—Ä–∏–æ–¥–∞
–ì–æ—Ä—è—á–∏–µ –≥–æ–ª–æ–≤—ã 2
–†—ã–±–∫–∞ –ø–æ –∏–º–µ–Ω–∏ –í–∞–Ω–¥–∞
–£–∑—ã –±—Ä–∞—Ç—Å—Ç–≤–∞


In [102]:
print("Let's check movies by specific director, e.g., John Lasseter:")
%timeit -r1 -n1
m_by_director = links.movies_by_director("John Lasseter")
print(m_by_director)

Let's check movies by specific director, e.g., John Lasseter:
['–ò—Å—Ç–æ—Ä–∏—è –∏–≥—Ä—É—à–µ–∫']


In [103]:
print('Checking the last method of this class :)')

%timeit -r1 -n1
shortest = links.shortest_movie()
print ('Shortest film...')
print (shortest)

Checking the last method of this class :)
Shortest film...
–õ—é–±–æ–≤—å –∏ 45 –∫–∞–ª–∏–±—Ä


Moving to the next class - Movies. It's small, only 3 methods

In [104]:
movies = Movies('data/ml-latest-small/movies.csv')

In [105]:
%timeit -r1 -n1
years = movies.dist_by_release()
print("Top-5 years by film count:")
for year, count in list(years.items())[:5]:
    print(f"  {year}: {count} films")

Top-5 years by film count:
  1995: 180 films
  1994: 141 films
  1996: 141 films
  1993: 83 films
  1992: 18 films


In [106]:
%timeit -r1 -n1
genres = movies.dist_by_genres()
print("Top-5 genres:")
for genre, count in list(genres.items())[:5]:
    print(f"  {genre}: {count} films")

Top-5 genres:
  Drama: 438 films
  Comedy: 311 films
  Romance: 207 films
  Thriller: 178 films
  Action: 121 films


In [107]:
%timeit -r1 -n1
multi_genres = movies.most_genres(5)
print("Top movies with most genres:")
for title, count in multi_genres.items():
    print(f"  {title}: {count} genres")

Top movies with most genres:
  Strange Days (1995): 6 genres
  "Lion King: 6 genres
  "Getaway: 6 genres
  Super Mario Bros. (1993): 6 genres
  Beauty and the Beast (1991): 6 genres
  All Dogs Go to Heaven 2 (1996): 6 genres
  Space Jam (1996): 6 genres
  Aladdin and the King of Thieves (1996): 6 genres
  Toy Story (1995): 5 genres
  Money Train (1995): 5 genres
  Copycat (1995): 5 genres
  "City of Lost Children: 5 genres
  Pocahontas (1995): 5 genres
  Bad Boys (1995): 5 genres
  "Kid in King Arthur's Court: 5 genres
  True Lies (1994): 5 genres
  RoboCop 3 (1993): 5 genres
  "Pagemaster: 5 genres
  Ghost (1990): 5 genres
  Aladdin (1992): 5 genres
  Snow White and the Seven Dwarfs (1937): 5 genres
  Heavy Metal (1981): 5 genres
  James and the Giant Peach (1996): 5 genres
  "Alphaville (Alphaville: 5 genres
  Oliver & Company (1988): 5 genres
  "Hunchback of Notre Dame: 5 genres
  North by Northwest (1959): 5 genres
  Charade (1963): 5 genres
  Beat the Devil (1953): 5 genres
  Cind

And examining the final class - Tags

In [108]:
tags = Tags('data/ml-latest-small/tags.csv')

In [109]:
print('Top-5 tags with most words')
%timeit tags.most_words(5)
most_wordy = tags.most_words(5)
print('Examining tags with most words helps understand how detailed user descriptions are.')
for tag, count in most_wordy.items():
    print(f"'{tag}' - {count} words")

Top-5 tags with most words
831 Œºs ¬± 41.6 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
Examining tags with most words helps understand how detailed user descriptions are.
'Something for everyone in this one... saw it without and plan on seeing it with kids!' - 16 words
'the catholic church is the most corrupt organization in history' - 10 words
'Oscar (Best Music - Original Score)' - 6 words
'Everything you want is here' - 5 words
'based on a true story' - 5 words


In [110]:
print('Top-5 longest tags')
%timeit tags.longest(5)
longest_tags = tags.longest(5)
print('Analyzing longest tags by character count shows user description detail level.')
for tag in longest_tags:
    print(f"'{tag}'")


Top-5 longest tags
651 Œºs ¬± 33.8 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
Analyzing longest tags by character count shows user description detail level.
'Something for everyone in this one... saw it without and plan on seeing it with kids!'
'the catholic church is the most corrupt organization in history'
'audience intelligence underestimated'
'Oscar (Best Music - Original Score)'
'assassin-in-training (scene)'


In [111]:
print('Intersection of most word-rich and longest tags (top-5)')
%timeit tags.most_words_and_longest(5)
intersection = tags.most_words_and_longest(5)
print('Finding tags that are both word-rich and long may indicate especially detailed descriptions.')
for tag in intersection:
    print(f"'{tag}'")

Intersection of most word-rich and longest tags (top-5)
1.44 ms ¬± 22.5 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
Finding tags that are both word-rich and long may indicate especially detailed descriptions.
'Oscar (Best Music - Original Score)'
'Something for everyone in this one... saw it without and plan on seeing it with kids!'
'the catholic church is the most corrupt organization in history'


In [112]:
print('Top-5 most popular tags')
%timeit tags.most_popular(5)
popular = tags.most_popular(5)
print('Examining most frequently used tags helps understand popular movie aspects.')
for tag, count in popular.items():
    print(f"'{tag}' - {count} times")

Top-5 most popular tags
637 Œºs ¬± 14.2 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
Examining most frequently used tags helps understand popular movie aspects.
'funny' - 15 times
'sci-fi' - 14 times
'dark comedy' - 12 times
'twist ending' - 12 times
'action' - 10 times


In [113]:
print('All tags containing word "sci"')
%timeit tags.tags_with('sci')
sci_tags = tags.tags_with('sci')
print('Finding all tags containing "sci" to understand how users describe science fiction.')
for tag in sci_tags:
    print(f"'{tag}'")

All tags containing word "sci"
191 Œºs ¬± 5.4 Œºs per loop (mean ¬± std. dev. of 7 runs, 10,000 loops each)
Finding all tags containing "sci" to understand how users describe science fiction.
'Sci-Fi'
'classic sci-fi'
'mad scientist'
'sci-fi'
'science fiction'
'scifi'
'scifi cult'
'sexy female scientist'


In [114]:
print('Top-3 tags for movie ID 60756')
%timeit tags.most_popular_by_movie(60756, 3)
popular_for_movie = tags.most_popular_by_movie(60756, 3)
print('Analyzing tags for specific movies gives insight into user perception.')
for tag, count in popular_for_movie.items():
    print(f"'{tag}' - {count} times")

Top-3 tags for movie ID 60756
196 Œºs ¬± 10 Œºs per loop (mean ¬± std. dev. of 7 runs, 1,000 loops each)
Analyzing tags for specific movies gives insight into user perception.
'funny' - 3 times
'will ferrell' - 3 times
'Highly quotable' - 1 times


In [115]:
print('Average tag length')
%timeit tags.average_tag_length()
avg_len = tags.average_tag_length()
print('Average tag length shows overall user description detail level.')
print(f'{avg_len:.2f} characters')

Average tag length
98.7 Œºs ¬± 4.36 Œºs per loop (mean ¬± std. dev. of 7 runs, 10,000 loops each)
Average tag length shows overall user description detail level.
10.71 characters


In [116]:
print('Total unique tags count')
print('Shows diversity of tags used by users.')
%timeit tags.get_unique_tags_count()
unique_count = tags.get_unique_tags_count()
print(f'{unique_count} tags')

Total unique tags count
Shows diversity of tags used by users.
127 Œºs ¬± 3.99 Œºs per loop (mean ¬± std. dev. of 7 runs, 10,000 loops each)
602 tags


Finish!