### Web Scrapping¶
### 1 - Dawn Articles
### 2 - Top Science Fiction Movies
### 3 - Web Scrapping From Multiple Pages
### 4 - Using AP

---

## 1 - Dawn Articles

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
webpage = requests.get('https://www.dawn.com/').text

In [3]:
soup=BeautifulSoup(webpage,'html.parser')

In [4]:
data = soup.find_all('article',attrs={'class':'story'})

In [5]:
data[0]

<article class="story box mb-2 pb-1" data-id="1717321" data-layout="story">
<!-- box/label -->
<span class="badge inline-flex text-3 font-normal rounded theme theme-default theme--live align-middle" dir="auto" id="live"><span>Live </span></span>
<!-- box/title title-bold-playfairdisplay-pb:1-text:5 -->
<h2 class="story__title text-5 font-bold font-playfair-display leading-tight pb-1" data-id="1717321" data-layout="story" dir="auto"><a class="story__link" href="https://www.dawn.com/news/1717321/imran-calls-on-long-march-participants-to-reach-rawalpindi-on-nov-26">Imran calls on long march participants to reach Rawalpindi on Nov 26</a></h2>
<!-- box/image -->
<figure class="media media--fill sm:w-full w-full mb-1">
<div class="media__item"><a href="https://www.dawn.com/news/1717321/imran-calls-on-long-march-participants-to-reach-rawalpindi-on-nov-26" target="_self" title=""><picture><img alt="Imran calls on long march participants to reach Rawalpindi on Nov 26" src="https://i.dawn.com/me

In [6]:
anchor = data[0].find('a')

In [7]:
anchor

<a class="story__link" href="https://www.dawn.com/news/1717321/imran-calls-on-long-march-participants-to-reach-rawalpindi-on-nov-26">Imran calls on long march participants to reach Rawalpindi on Nov 26</a>

In [8]:
anchor.text

'Imran calls on long march participants to reach Rawalpindi on Nov 26'

In [9]:
anchor['href']

'https://www.dawn.com/news/1717321/imran-calls-on-long-march-participants-to-reach-rawalpindi-on-nov-26'

In [10]:
data[0].find('img')

<img alt="Imran calls on long march participants to reach Rawalpindi on Nov 26" src="https://i.dawn.com/medium/2022/11/19172041ef8891f.png?r=172059"/>

In [11]:
data[0].find('img')['src']

'https://i.dawn.com/medium/2022/11/19172041ef8891f.png?r=172059'

In [12]:
titles = []
paragraph = []
img = []
author_names = []
published_time = []
images = []
links = []

for i in data:
    anchor = i.find('a')
    img = i.find('img')
    titles.append(anchor.text)
    links.append(anchor['href'])
    if img:
        images.append(img['src'])
    else:
        images.append(' ')
    
    author = i.find('span','story__byline').a.text if i.find('span','story__byline') else ''
    author_names.append(author)
    
    date = i.find('span', class_ = 'timestamp--time  timeago').text if i.find('span', class_ = 'timestamp--time  timeago') else '**'
    published_time.append(date)
    
    para = i.find('div', class_ = 'story__excerpt').text if i.find('div', class_ = 'story__excerpt') else ''
    paragraph.append(para)

In [13]:
len(titles),len(links), len(images)

(180, 180, 180)

In [14]:
dawn_df = pd.DataFrame({'Titles':titles,'Author Names':author_names,'Published Time':published_time,'First Paragraph':paragraph,'Links':links,'Images':images})

In [15]:
dawn_df.head()

Unnamed: 0,Titles,Author Names,Published Time,First Paragraph,Links,Images
0,Imran calls on long march participants to reac...,,**,"""I will meet you there,"" says the PTI chief.\n...",https://www.dawn.com/news/1717321/imran-calls-...,https://i.dawn.com/medium/2022/11/19172041ef88...
1,"Pakistan will not default, will make bond paym...",,**,Requests people to avoid spreading rumours or ...,https://www.dawn.com/news/1721879/pakistan-wil...,https://i.dawn.com/medium/2022/11/19164841e585...
2,PPP’s Chandio brothers indicted in Mehar tripl...,,**,Court summons witnesses in next hearing on Dec...,https://www.dawn.com/news/1721876/ppps-chandio...,https://i.dawn.com/medium/2022/11/191623231363...
3,"In report to IHC, capital police highlight ris...",,**,"Report suggests ""financial/bank guarantees"" be...",https://www.dawn.com/news/1721867/in-report-to...,https://i.dawn.com/medium/2022/11/191505364972...
4,Karachi police disperse protesting Islamia Col...,,**,Islami Jamiat Talaba claims five students were...,https://www.dawn.com/news/1721874/karachi-poli...,https://i.dawn.com/medium/2022/11/1915395818a6...


In [16]:
dawn_df.tail(3)

Unnamed: 0,Titles,Author Names,Published Time,First Paragraph,Links,Images
177,Poet's Corner\n,,**,,https://www.dawn.com/news/1721858/poets-corner,
178,Website review: All the related words\n,,**,,https://www.dawn.com/news/1721856/website-revi...,
179,Cook-it-yourself: Malt caffè mocha\n,,**,,https://www.dawn.com/news/1721855/cook-it-your...,


In [17]:
dawn_df.describe()

Unnamed: 0,Titles,Author Names,Published Time,First Paragraph,Links,Images
count,180.0,180.0,180,180.0,180,180.0
unique,105.0,20.0,1,35.0,141,71.0
top,,,**,,https://www.dawn.com/news/1721598/lky-on-not-c...,
freq,59.0,160.0,180,145.0,4,107.0


In [18]:
dawn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Titles           180 non-null    object
 1   Author Names     180 non-null    object
 2   Published Time   180 non-null    object
 3   First Paragraph  180 non-null    object
 4   Links            180 non-null    object
 5   Images           180 non-null    object
dtypes: object(6)
memory usage: 8.6+ KB


In [19]:
dawn_df.isnull()

Unnamed: 0,Titles,Author Names,Published Time,First Paragraph,Links,Images
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
175,False,False,False,False,False,False
176,False,False,False,False,False,False
177,False,False,False,False,False,False
178,False,False,False,False,False,False


In [20]:
import numpy as np
df = dawn_df.replace(r'^s*$', np.nan, regex=True) # replacing empty cells with NaN values
df

Unnamed: 0,Titles,Author Names,Published Time,First Paragraph,Links,Images
0,Imran calls on long march participants to reac...,,**,"""I will meet you there,"" says the PTI chief.\n...",https://www.dawn.com/news/1717321/imran-calls-...,https://i.dawn.com/medium/2022/11/19172041ef88...
1,"Pakistan will not default, will make bond paym...",,**,Requests people to avoid spreading rumours or ...,https://www.dawn.com/news/1721879/pakistan-wil...,https://i.dawn.com/medium/2022/11/19164841e585...
2,PPP’s Chandio brothers indicted in Mehar tripl...,,**,Court summons witnesses in next hearing on Dec...,https://www.dawn.com/news/1721876/ppps-chandio...,https://i.dawn.com/medium/2022/11/191623231363...
3,"In report to IHC, capital police highlight ris...",,**,"Report suggests ""financial/bank guarantees"" be...",https://www.dawn.com/news/1721867/in-report-to...,https://i.dawn.com/medium/2022/11/191505364972...
4,Karachi police disperse protesting Islamia Col...,,**,Islami Jamiat Talaba claims five students were...,https://www.dawn.com/news/1721874/karachi-poli...,https://i.dawn.com/medium/2022/11/1915395818a6...
...,...,...,...,...,...,...
175,FIFA World Cup Qatar 2022: The fun begins\n,,**,,https://www.dawn.com/news/1721861/fifa-world-c...,
176,Mailbox\n,,**,,https://www.dawn.com/news/1721860/mailbox,
177,Poet's Corner\n,,**,,https://www.dawn.com/news/1721858/poets-corner,
178,Website review: All the related words\n,,**,,https://www.dawn.com/news/1721856/website-revi...,


In [21]:
df.shape

(180, 6)

In [22]:
df.to_csv('dawn_articles.csv')

## 2 - Top Science Fiction Movies

In [23]:
webpage = requests.get('https://www.imdb.com/search/title/?genres=sci-fi&explore=title_type,genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3396781f-d87f-4fac-8694-c56ce6f490fe&pf_rd_r=8J8TDTSFDKRN3C9SVGCP&pf_rd_s=center-1&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_pr1_i_2').text

In [24]:
soup=BeautifulSoup(webpage,'html.parser')

In [25]:
movie_names = []
year = []
genre = []
rating = []

In [26]:
movies_data = soup.find_all('div', class_ = 'lister-item mode-advanced')

In [27]:
for i in movies_data:
    name = i.h3.a.text
    movie_names.append(name)

In [28]:
len(movie_names)

50

In [29]:
for i in movies_data:
    yr = i.h3.find('span', class_= 'lister-item-year text-muted unbold').text.replace('(','').replace(')','')
    year.append(yr)

In [30]:
len(year)

50

In [31]:
for i in movies_data:
    gn = i.p.find('span', class_ = 'genre').text.replace('\n','')
    genre.append(gn)

In [32]:
len(genre)

50

In [33]:
for i in movies_data:
    rt = i.find('div', class_= 'inline-block ratings-imdb-rating').text.replace('\n','') if i.find('div', class_= 'inline-block ratings-imdb-rating') else '^^'
    rating.append(rt)

In [34]:
len(rating)

50

In [35]:
movies_df = pd.DataFrame({'Name of Movie':movie_names,'Year of Release':year,'Genre':genre,'Rating':rating})

In [36]:
movies_df.head()

Unnamed: 0,Name of Movie,Year of Release,Genre,Rating
0,Black Panther: Wakanda Forever,2022,"Action, Adventure, Drama",7.3
1,Andor,2022–,"Action, Adventure, Drama",8.3
2,The Peripheral,2022–,"Drama, Mystery, Sci-Fi",8.2
3,Manifest,2018–,"Drama, Mystery, Sci-Fi",7.1
4,Black Adam,2022,"Action, Adventure, Fantasy",6.9


In [37]:
movies_df.to_csv('top_Sci_movies.csv')

## 3 - Web Scrapping From Multiple Webpages

In [38]:
import numpy as np
from time import sleep
from random import randint

In [39]:
movie_names = []
year = []
genre = []
rating = []

In [40]:
pages = np.arange(1,1000,50)

In [41]:
for page in pages:
    page = requests.get('https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&start='+str(page)+'&ref_=adv_nxt')
    soup = BeautifulSoup(page.text, 'html.parser')
    movie_data = soup.find_all('div', class_ = 'lister-item mode-advanced')
    sleep(randint(2,8))
    for i in movie_data:
        name = i.h3.a.text
        movie_names.append(name)
        
        yr = i.h3.find('span', class_= 'lister-item-year text-muted unbold').text.replace('(','').replace(')','')
        year.append(yr)
        
        gn = i.p.find('span', class_ = 'genre').text.replace('\n','')
        genre.append(gn)
        
        rt = i.find('div', class_= 'inline-block ratings-imdb-rating').text.replace('\n','') if i.find('div', class_= 'inline-block ratings-imdb-rating') else '^^'
        rating.append(rt)

In [42]:
movies_df = pd.DataFrame({'Name of Movie':movie_names,'Year of Release':year,'Genre':genre,'Rating':rating})

In [43]:
movies_df.head()

Unnamed: 0,Name of Movie,Year of Release,Genre,Rating
0,The Shawshank Redemption,1994,Drama,9.3
1,The Godfather,1972,"Crime, Drama",9.2
2,Kantara,2022,"Action, Adventure, Drama",9.0
3,The Dark Knight,2008,"Action, Crime, Drama",9.0
4,The Lord of the Rings: The Return of the King,2003,"Action, Adventure, Drama",9.0


In [44]:
movies_df.to_csv('top_1000_movies.csv')

## 4 - Using API

In [45]:
response = requests.get('https://api.themoviedb.org/3/movie/top_rated?api_key=8265bd1679663a7ea12ac168da84d2e8&language=en-US&page=1')

In [46]:
response.json()['results']

[{'adult': False,
  'backdrop_path': '/rl7Jw8PjhSIjArOlDNv0JQPL1ZV.jpg',
  'genre_ids': [10749, 18],
  'id': 851644,
  'original_language': 'ko',
  'original_title': '20 Century Girl',
  'overview': "Yeon-du asks her best friend Bora to collect all the information she can about Baek Hyun-jin while she is away in the U.S. for heart surgery. Bora decides to get close to Baek's best friend, Pung Woon-ho first. However, Bora's clumsy plan unfolds in an unexpected direction. In 1999, a year before the new century, Bora, who turns seventeen, falls into the fever of first love.",
  'popularity': 341.026,
  'poster_path': '/od22ftNnyag0TTxcnJhlsu3aLoU.jpg',
  'release_date': '2022-10-06',
  'title': '20th Century Girl',
  'video': False,
  'vote_average': 8.8,
  'vote_count': 248},
 {'adult': False,
  'backdrop_path': '/rSPw7tgCH9c6NqICZef4kZjFOQ5.jpg',
  'genre_ids': [18, 80],
  'id': 238,
  'original_language': 'en',
  'original_title': 'The Godfather',
  'overview': 'Spanning the years 1945

In [47]:
pd.DataFrame(response.json()['results']).head()

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/rl7Jw8PjhSIjArOlDNv0JQPL1ZV.jpg,"[10749, 18]",851644,ko,20 Century Girl,Yeon-du asks her best friend Bora to collect a...,341.026,/od22ftNnyag0TTxcnJhlsu3aLoU.jpg,2022-10-06,20th Century Girl,False,8.8,248
1,False,/rSPw7tgCH9c6NqICZef4kZjFOQ5.jpg,"[18, 80]",238,en,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",91.365,/3bhkrj58Vtu7enYsRolD1fZdja1.jpg,1972-03-14,The Godfather,False,8.7,16917
2,False,/kXfqcdQKsToO0OUXHcrrNCHDBzO.jpg,"[18, 80]",278,en,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,65.133,/q6y0Go1tsGEsmtFryDOJo3dEmqu.jpg,1994-09-23,The Shawshank Redemption,False,8.7,22668
3,False,/poec6RqOKY9iSiIUmfyfPfiLtvB.jpg,"[18, 80]",240,en,The Godfather Part II,In the continuing saga of the Corleone crime f...,50.839,/hek3koDUyRQk7FIhPXsa6mT2Zc3.jpg,1974-12-20,The Godfather Part II,False,8.6,10256
4,False,/aVFx1VtlOxR3v0ADEatalXOvwbu.jpg,"[16, 14, 28]",620249,zh,罗小黑战记,"In the bustling human world, spirits live peac...",15.855,/aLv87NgRJUPkQ6sVLP72IisDdt4.jpg,2019-08-27,The Legend of Hei,False,8.6,208


In [48]:
df.to_csv('movies.csv')