# Extracting data using Cinemagoer

__Authors:__

|             Nombres            |       Correo       |
|--------------------------------|--------------------|
|Laura Vanessa Lopez Serrano     | llopezs6@uc.cl     |
|Luis Antonio Aguilar Guti√©rrez  | laaguilarg@uc.cl   |

## Intro

In this notebook, we are going to extract the cast from a list of movies on Imdb. We are going to install the latest version of [Cinemagoer](https://github.com/cinemagoer/cinemagoer):

```bash
pip install git+https://github.com/cinemagoer/cinemagoer
```


## Data loading

We need to load and extract the data from the MovieLens-100k. This code was from the tutorial

In [1]:
!pip install wget
!pip install zipfile36
!python -m wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

100% [......................................................] 4924029 / 4924029
Saved under ml-100k.zip


In [3]:
import zipfile
with zipfile.ZipFile("ml-100k.zip", 'r') as zip_ref:
    zip_ref.extractall(".")

In [1]:
from data_operations import load_data

In [4]:
df_items = load_data()

In [6]:
df_items.head()

Unnamed: 0,itemid,title,release_date,video_release_date,IMDb_URL,unknown,genres
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,Animation|Children|Comedy
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,Action|Adventure|Thriller
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,Thriller
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,Action|Comedy|Drama
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,Crime|Drama|Thriller


## Data processing

In [10]:
def rename_movies(x):
    if ',' in x:
        title_split = x.split(',')
        title_first = title_split[1].split(' ')
        title_second = title_split[0]
        title = title_first[1] + ' ' + title_second + ' ' + title_first[-1]
        return title
    else:
        return x

In [11]:
df_items['title_query'] = df_items.title.apply(rename_movies)

In [12]:
df_items.head()

Unnamed: 0,itemid,title,release_date,video_release_date,IMDb_URL,unknown,genres,title_query
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,Animation|Children|Comedy,Toy Story (1995)
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,Action|Adventure|Thriller,GoldenEye (1995)
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,Thriller,Four Rooms (1995)
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,Action|Comedy|Drama,Get Shorty (1995)
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,Crime|Drama|Thriller,Copycat (1995)


## Data extraction

In [20]:
from tqdm import tqdm
from imdb import Cinemagoer
import pandas as pd
import warnings
#warnings.filterwarnings("ignore")

In [27]:
exclude = ['unknown']
query_movies = df_items.groupby('title_query')['genres'].count().sort_values(ascending = False).reset_index(name = 'count')
query_movies.head()

Unnamed: 0,title_query,count
0,The Designated Mourner (1997),2
1,Body Snatchers (1993),2
2,Chasing Amy (1997),2
3,Sliding Doors (1998),2
4,Desperate Measures (1998),2


In [28]:
actors = []
movies_list = []
not_access = []
ia = Cinemagoer()
for row in tqdm(query_movies.itertuples()):
    try:
        if row.title_query not in exclude:
            # get movie id
            with warnings.catch_warnings():
                warnings.simplefilter('ignore')
                movie_id = ia.search_movie(row.title_query)[0].movieID
                # get movie data
                movie = ia.get_movie(movie_id, info = ['main'])
            # append actors
            actors.append(movie.get('stars'))
            movies_list.append(row.title_query)
        else:
            actors.append(['Unknow'])
            movies_list.append(row.title_query)
    except:
        not_access.append(row.title_query)
        actors.append(['Unknow'])
        movies_list.append(row.title_query)

59it [03:42,  4.19s/it]2024-04-28 17:27:39,370 CRITICAL [imdbpy] /home/antonio/anaconda3/envs/res/lib/python3.9/site-packages/imdb/_exceptions.py:32: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'https://www.imdb.com/find/?q=Somewhere+in+Time+%281980%29&s=tt', 'proxy': '', 'exception type': 'IOError', 'original exception': <HTTPError 405: 'Not Allowed'>},); kwds: {}
Traceback (most recent call last):
  File "/home/antonio/anaconda3/envs/res/lib/python3.9/site-packages/imdb/parser/http/__init__.py", line 233, in retrieve_unicode
    response = uopener.open(url)
  File "/home/antonio/anaconda3/envs/res/lib/python3.9/urllib/request.py", line 523, in open
    response = meth(req, response)
  File "/home/antonio/anaconda3/envs/res/lib/python3.9/urllib/request.py", line 632, in http_response
    response = self.parent.error(
  File "/home/antonio/anaconda3/envs/res/lib/python3.9/urllib/request.py", line 561, in error
    return self._call_chain(*arg

In [35]:
# check total movies list and actor list
print(len(movies_list), len(actors))

1664 1664


Now we convert the data into a `dataframe`:

In [36]:
actors_movies = pd.DataFrame()
actors_movies['movies'] = movies_list
actors_movies['actors'] = actors
# replace None with Unknow
actors_movies.fillna(value = 'Unknow', inplace = True)

In [37]:
actors_movies.head()

Unnamed: 0,movies,actors
0,The Designated Mourner (1997),"[Mike Nichols, Miranda Richardson, David de Ke..."
1,Body Snatchers (1993),"[Terry Kinney, Meg Tilly, Gabrielle Anwar]"
2,Chasing Amy (1997),"[Ethan Suplee, Ben Affleck, Scott Mosier]"
3,Sliding Doors (1998),"[Gwyneth Paltrow, John Hannah, John Lynch]"
4,Desperate Measures (1998),"[Michael Keaton, Andy Garcia, Brian Cox]"


Now we need to add movie `ID`, we can add it using a `merge` with the `ID's` from `df_items`:

In [39]:
movie_items = df_items[['itemid', 'title']].copy()
movie_items['title'] = movie_items.title.apply(rename_movies)
movie_items.rename(columns = {'title': 'movies'}, inplace = True)
movie_items = movie_items.merge(actors_movies, how='inner', on='movies')
movie_items.head()

Unnamed: 0,itemid,movies,actors
0,1,Toy Story (1995),"[Tom Hanks, Tim Allen, Don Rickles]"
1,2,GoldenEye (1995),"[Pierce Brosnan, Sean Bean, Izabella Scorupco]"
2,3,Four Rooms (1995),"[Sammi Davis, Amanda De Cadenet, Valeria Golino]"
3,4,Get Shorty (1995),"[John Travolta, Gene Hackman, Rene Russo]"
4,5,Copycat (1995),"[Sigourney Weaver, Holly Hunter, Dermot Mulroney]"


Now, we save the results into a `csv` file:

In [40]:
movie_items.to_csv('./movie_items.csv', sep = ',', index = False)