# Web scraping

I used data from wikipedia to web scrape information on the movies. For example, for the MCU I used the wikipedia article to scrape names of movies, release dates, as well as names of directors and producers:

In [1]:
import pandas as pd

# We obtain 4 tables from the wikipedia page
scrape = pd.read_html('https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_films', match='U.S. release date', header=0)
len(scrape)

4

Next, we need to combine tables of indices 0 and 1 into one table

In [58]:
data1 = scrape[0]
data1.head(5)

Unnamed: 0,Film,U.S. release date,Director(s),Screenwriter(s),Producer(s),Unnamed: 5
0,Phase One[24],Phase One[24],Phase One[24],Phase One[24],Phase One[24],Phase One[24]
1,Iron Man,"May 2, 2008",Jon Favreau[27],Mark Fergus & Hawk Ostby and Art Marcum & Matt...,Avi Arad and Kevin Feige,
2,The Incredible Hulk,"June 13, 2008",Louis Leterrier[29],Zak Penn[30],"Avi Arad, Gale Anne Hurdand Kevin Feige",
3,Iron Man 2,"May 7, 2010",Jon Favreau[31],Justin Theroux[32],Kevin Feige,
4,Thor,"May 6, 2011",Kenneth Branagh[33],Ashley Edward Miller & Zack Stentz and Don Pay...,Kevin Feige,


In [3]:
data2 = scrape[1]
data2.head(5)

Unnamed: 0,Film,U.S. release date,Director(s),Screenwriter(s),Producer(s),Status
0,Phase Four[69],Phase Four[69],Phase Four[69],Phase Four[69],Phase Four[69],Phase Four[69]
1,Black Widow,"July 9, 2021[b]",Cate Shortland[72],Eric Pearson[73],Kevin Feige,Released
2,Shang-Chi and the Legend of the Ten Rings,"September 3, 2021",Destin Daniel Cretton[74],Dave Callaham & Destin Daniel Cretton & Andrew...,Kevin Feige andJonathan Schwartz,Released
3,Eternals,"November 5, 2021",Chloé Zhao[76],Chloé Zhao and Chloé Zhao & Patrick Burleighan...,Kevin Feigeand Nate Moore,Released
4,Spider-Man: No Way Home,"December 17, 2021",Jon Watts[79],Chris McKenna & Erik Sommers[80],Kevin Feigeand Amy Pascal,Released


In [59]:
data_12 = data1.append(data2, ignore_index=True)
data_12

  data_12 = data1.append(data2, ignore_index=True)


Unnamed: 0,Film,U.S. release date,Director(s),Screenwriter(s),Producer(s),Unnamed: 5,Status
0,Phase One[24],Phase One[24],Phase One[24],Phase One[24],Phase One[24],Phase One[24],
1,Iron Man,"May 2, 2008",Jon Favreau[27],Mark Fergus & Hawk Ostby and Art Marcum & Matt...,Avi Arad and Kevin Feige,,
2,The Incredible Hulk,"June 13, 2008",Louis Leterrier[29],Zak Penn[30],"Avi Arad, Gale Anne Hurdand Kevin Feige",,
3,Iron Man 2,"May 7, 2010",Jon Favreau[31],Justin Theroux[32],Kevin Feige,,
4,Thor,"May 6, 2011",Kenneth Branagh[33],Ashley Edward Miller & Zack Stentz and Don Pay...,Kevin Feige,,
5,Captain America: The First Avenger,"July 22, 2011",Joe Johnston[35],Christopher Markus & Stephen McFeely[36],Kevin Feige,,
6,Marvel's The Avengers,"May 4, 2012",Joss Whedon[37],Joss Whedon[37],Kevin Feige,,
7,Phase Two[24],Phase Two[24],Phase Two[24],Phase Two[24],Phase Two[24],Phase Two[24],
8,Iron Man 3,"May 3, 2013",Shane Black[38],Drew Pearce and Shane Black[38][39],Kevin Feige,,
9,Thor: The Dark World,"November 8, 2013",Alan Taylor[40],Christopher L. Yost and Christopher Markus & S...,Kevin Feige,,


# Cleaning the data

Now it's time to prepare the data for analysis by cleaning it. 

In [60]:
# Remove rows containing 'Phase'
data_12_proc = data_12[data_12['Film'].str.contains('Phase')==False]
# Remove string sequences from the columns. These sequences are as follows: [1], [2], [10], [a]
data_12_proc[['U.S. release date', 'Director(s)']] = data_12_proc[['U.S. release date', 'Director(s)']].replace(to_replace=r"\[[0-9a-zA-Z]*\]", value='', regex=True)
# In the final dataframe, only use four columns - 'Film', 'U.S. release date', 'Director(s)', 'Producer(s)'
data_12_proc = data_12_proc[['Film', 'U.S. release date', 'Director(s)', 'Producer(s)']]
"""
Next, the placement of commas in two columns - 'Director(s)' and 'Producer(s)' - is highly inconsistent. 
It is written in the following ways (pay attention to the spaces):
 - "name1, name2"
 - "name1,name2"
 - "name ,name2"
 - "name & name2"
The following step will homogenise these to the same standard way of writing: "name1, name2"
"""
data_12_proc[['Director(s)', 'Producer(s)']] = data_12_proc[['Director(s)', 'Producer(s)']].replace('and|&', ',', regex=True).replace('[a-zA-Z]( ,)', ',', regex=True).replace('[a-zA-Z](,)[a-zA-Z]', ' ,', regex=True)
# Convert one column to datetime format and filter rows by being earlier than the present date
data_12_proc['U.S. release date'] = pd.to_datetime(data_12_proc['U.S. release date'])
data_12_proc[ data_12_proc['U.S. release date'] < pd.to_datetime('today') ]



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_12_proc[['U.S. release date', 'Director(s)']] = data_12_proc[['U.S. release date', 'Director(s)']].replace(to_replace=r"\[[0-9a-zA-Z]*\]", value='', regex=True)


Unnamed: 0,Film,U.S. release date,Director(s),Producer(s)
1,Iron Man,2008-05-02,Jon Favreau,"Avi Ara, Kevin Feige"
2,The Incredible Hulk,2008-06-13,Louis Leterrier,"Avi Arad, Gale Anne Hurd, Kevin Feige"
3,Iron Man 2,2010-05-07,Jon Favreau,Kevin Feige
4,Thor,2011-05-06,Kenneth Branagh,Kevin Feige
5,Captain America: The First Avenger,2011-07-22,Joe Johnston,Kevin Feige
6,Marvel's The Avengers,2012-05-04,Joss Whedon,Kevin Feige
8,Iron Man 3,2013-05-03,Shane Black,Kevin Feige
9,Thor: The Dark World,2013-11-08,Alan Taylor,Kevin Feige
10,Captain America: The Winter Soldier,2014-04-04,"Anthon, Joe Russo",Kevin Feige
11,Guardians of the Galaxy,2014-08-01,James Gunn,Kevin Feige
