# Base de datos IMDB

**Eres parte del equipo de Datos** 🤓 de una StartUp en su primera etapa de inversión que se dedica a la venta de articulos colección de las peliculas 🎬 y series de TV 📺 que aparecen en los **ranking top** 🏆 de U.S.A.  

Por ahora **la API** 🤖 que utilizan para extraer los datos de las peliculas top de IMDb **continua caída** por [las fallas de AWS](https://www.theguardian.com/technology/2021/dec/07/amazon-web-services-outage-hits-sites-and-apps-such-as-imdb-and-tinder) 😓 del 7 de Diciembre del 2021, por lo que **se requiere que se establezca una alternativa**.

**Tú como lider** del equipo de interns **has definido el plan de obtener** la lista de las peliculas target **a partir del Scrapping** 🦾 de las paginas de los top del sitio web **de IMDb** 🎬.

# Instrucciones

Deberás compartirle al equipo de producción las listas de peliculas y shows de las siguientes categorías generando un archivo CSV:

- [Top 10 Most Popular Movies](https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm)
- [Top 10 de peliculas del 2021 al 2018](https://www.imdb.com/chart/top?sort=us,desc&mode=simple&page=1)
- [Top 10 de TV Shows](https://www.imdb.com/chart/toptv/?sort=us,desc&mode=simple&page=1)

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
r = requests.get('https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm')
soup = BeautifulSoup(r.text)

pelis = []

for peli in soup.find_all('td', attrs = {'class': 'titleColumn'})[:10]:
    pelis.append(peli.find('a').text)
    
print(pelis)

['Day Shift', 'Prey', 'Bullet Train', 'Nope', 'Top Gun: Maverick', 'The Menu', 'Elvis', 'Thirteen Lives', 'The Gray Man', 'Orphan: First Kill']


In [3]:
movies = []


for peli in soup.find_all('tr')[:10]:
    movie = []
    children = peli.findChildren('td', attrs = {'class': 'titleColumn'})
    for child in children:
      title = child.find('a').text
      year = child.find('span', attrs = {'class': 'secondaryInfo'}).text
      movie.append(title)
      movie.append(year)
    movies.append(movie)

movies

[[],
 ['Day Shift', '(2022)'],
 ['Prey', '(2022)'],
 ['Bullet Train', '(2022)'],
 ['Nope', '(2022)'],
 ['Top Gun: Maverick', '(2022)'],
 ['The Menu', '(2022)'],
 ['Elvis', '(2022)'],
 ['Thirteen Lives', '(2022)'],
 ['The Gray Man', '(2022)']]

In [29]:
# Top 10 Most Popular Movies

r = requests.get('https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm')
soup = BeautifulSoup(r.text)

movies = []

for index, peli in enumerate(soup.find_all('tr')[:10]):
    movie = []
    if peli.find('a') == None or peli.find('strong') == None:
      continue
    movie.append(peli.find('td', attrs = {'class': 'titleColumn'}).find('a').text)
    movie.append(peli.find('span', attrs = {'class': 'secondaryInfo'}).text)
    movie.append(peli.find('strong').text)
    movie.append(f'Top {index} Most Popular Movie')
    movies.append(movie)

movies_df = pd.DataFrame(movies, columns=['Movie', 'Year', 'IMDb Rate', 'Description'])
movies_df['Year'] = movies_df['Year'].map(lambda x: x.replace('(','').replace(')',''))
df1 = movies_df
df1

Unnamed: 0,Movie,Year,IMDb Rate,Description
0,Day Shift,2022,6.1,Top 1 Most Popular Movie
1,Prey,2022,7.2,Top 2 Most Popular Movie
2,Bullet Train,2022,7.5,Top 3 Most Popular Movie
3,Nope,2022,7.3,Top 4 Most Popular Movie
4,Top Gun: Maverick,2022,8.5,Top 5 Most Popular Movie
5,Elvis,2022,7.6,Top 7 Most Popular Movie
6,Thirteen Lives,2022,7.8,Top 8 Most Popular Movie
7,The Gray Man,2022,6.5,Top 9 Most Popular Movie


In [22]:
# Function for automating the web scraping of the IMDb website
def imdb(url, kind):
  """
  This functions obtains the name, year and IMDb rate for the top 10 from the IMDb website.
  It also includes a column stating the kind of elements scraped.
  """
  r = requests.get(url)
  soup = BeautifulSoup(r.text)

  movies = []

  for index, peli in enumerate(soup.find_all('tr')[:10]):
      movie = []
      if peli.find('a') == None or peli.find('strong') == None:
        continue
      movie.append(peli.find('td', attrs = {'class': 'titleColumn'}).find('a').text)
      movie.append(peli.find('td', attrs = {'class': 'titleColumn'}).find('span', attrs = {'class': 'secondaryInfo'}).text)
      movie.append(peli.find('strong').text)
      movie.append(f'Top {index} {kind}')
      movies.append(movie)

  movies_df = pd.DataFrame(movies, columns=['Movie', 'Year', 'IMDb Rate', 'Description'])
  movies_df['Year'] = movies_df['Year'].map(lambda x: x.replace('(','').replace(')',''))
  
  return movies_df

In [23]:
df2 = imdb('https://www.imdb.com/chart/top?sort=us,desc&mode=simple&page=1', 'Movie')
df2

Unnamed: 0,Movie,Year,IMDb Rate,Description
0,Top Gun: Maverick,2022,8.4,Top 1 Movie
1,Everything Everywhere All at Once,2022,8.1,Top 2 Movie
2,Spider-Man: No Way Home,2021,8.2,Top 3 Movie
3,Jai Bhim,2021,8.0,Top 4 Movie
4,Hamilton,2020,8.2,Top 5 Movie
5,The Father,2020,8.2,Top 6 Movie
6,1917,2019,8.2,Top 7 Movie
7,Klaus,2019,8.1,Top 8 Movie
8,Joker,2019,8.3,Top 9 Movie


In [24]:
df3 = imdb('https://www.imdb.com/chart/toptv/?sort=us,desc&mode=simple&page=1', 'TV Show')
df3

Unnamed: 0,Movie,Year,IMDb Rate,Description
0,The Offer,2022,8.5,Top 1 TV Show
1,Heartstopper,2022,8.6,Top 2 TV Show
2,SPY×FAMILY,2022,8.4,Top 3 TV Show
3,Severance,2022,8.6,Top 4 TV Show
4,Rocket Boys,2022,8.4,Top 5 TV Show
5,1883,2021,8.6,Top 6 TV Show
6,The Beatles: Get Back,2021,8.9,Top 7 TV Show
7,Arcane,2021,8.9,Top 8 TV Show
8,Dopesick,2021,8.5,Top 9 TV Show


In [30]:
# Exporting the dataframes as CSV files
df1.to_csv('Top10MostPopularMovies.csv', index=False)
df2.to_csv('Top10Movies.csv', index=False)
df3.to_csv('Top10TVShows.csv', index=False)

In [None]:
# Done