<a href="https://colab.research.google.com/github/AmirAflak/Filimo-movies-analysis/blob/main/filimo_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Scraping filimo website contents with: 
*   Requests
*   BeautifulSoup4
##Eventually :
####Crawl some usefull informations of each movie on [filimo](https://www.filimo.com/movies/1/iran).
####Structure contents as a pandas dataFrame.
####Save it as a csv format .
####Try to extract some suitable insights about movies
##**Notice** : 
####We are going to crawl only 400 records(movies)
####All movies are iranian (does not contain Series)
####Analysis part is in another notebook
##So...Let's dig into it :)

In [1]:
# import necessary libraries : 
import pandas as pd
import requests 
from bs4 import BeautifulSoup as bs
import time
from selenium import webdriver
import pandas as pd

In [None]:
'''
send request to filimo and get content,
as we look at the web page,
there are not all 400 movies in one page,
there is a 'load-more' button,
we should press it each time in order to load more movies,
so i set selenium driver to stay in web page for 50 second,
in 50 second we should press 'load-more' button several times,
then we got the data and we can convert it as a object of bs4 library.
''' 
driver = webdriver.Chrome()
url = "https://www.filimo.com/movies/1/iran"
driver.get(url)
time.sleep(50)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
print(soup.prettify())


<!DOCTYPE html>
<html class="not-TV is-fa" lang="fa">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, user-scalable=0" name="viewport"/>
  <meta content="ie=edge" http-equiv="X-UA-Compatible"/>
  <meta content="yes" name="apple-mobile-web-app-capable"/>
  <link href="/assets/web/ui/img-WeHSW4U3lW1X8ksQFIDvA/filimo/favicon.ico" rel="icon"/>
  <meta content="PC18uVI6U1HMyblxoEkQvpSr3ILr1oswK0gqsWM6WW0=" name="verify-v1"/>
  <meta content="Qcb_HaFyYf0ogZVoQsp_Pd1D1ctl5NWBl-_f80XTaGc" name="google-site-verification"/>
  <meta content="تماشای آنلاین فیلم و سریال | فیلیمو" property="og:title"/>
  <meta content="https://www.filimo.com/assets/web/ui/img-WeHSW4U3lW1X8ksQFIDvA/filimo/filimo_200.png" property="og:image"/>
  <meta content="https://www.filimo.com" property="og:url"/>
  <meta content="تماشای آنلاین فیلم و سریال  با فیلیمو | آرشیو بیش از ۴۵۰۰۰ هزار عنوان ایرانی و خارجی  به همراه دوبله فارسی و زیرنویس چسبیده | با قابلیت دانلود با کمک اپلیکیشن فی

In [None]:
# so i wanna crawl all this informations from each item :
title = []
genre = []
production = []
actor = []
producer = []
rate = []
rate_percent = []
imdb  = []

In [None]:
'''
first of all we should find a tag for each movie, 
a tag which should contain all of informations we need, 
it turns out that every movie item is in a div tag with class of below :
'''
items = soup.find_all('div',
                      attrs = {'class' : 'ds-movie-item ui-mb-4x ui-pt-2x'})
'''
so now we found a parent tag for each movie,
which contains all informations we need.
'''

In [None]:
'''
i wanna do my whole crawl stuffs just within a function,
for each div parent tag of items(movies) :
'''
def fill_fields(item) :
    #grab title of each movie:
    try :
        title.append(item.find('div', 
                        attrs = {'class' : 'small-font truncate ui-pt-4x list_title'}).a.text)
    # if movie title was empty, fill None instead :
    except : 
        title.append(None)
    #grab genres of each movie:
    try :
        genre.append(item.find('p',
                                attrs = {'class' : 'ds-thumb-info ui-mt-2x'}).text)
    # if movie genre was empty, fill None instead :
    except : 
        genre.append(None)
    #grab production year of each movie:
    try :
        production.append(item.find_all('p',
                                         attrs = {'class' : 'ds-thumb-info ui-mt-2x'})[1].text.split('-')[0].strip())
    # if production year of movie was empty, fill None instead :
    except : 
        production.append(None)
    #grab imdb rate of each movie:
    try :
        imdb.append(item.find_all('span',
                 attrs = {'class' : 'ds-badge_label'})[1].get_text().split('/')[0])
    # if imdb rate of movie was empty, fill None instead :
    except : 
        imdb.append(None)
        
    #grab (filimo) rate percent of each movie:
    try :
        rate_percent.append(item.find('span',
                 attrs = {'class' : 'ds-badge_label'}).get_text(strip=True))
    # if rate percent of any movie was empty, fill None instead :
    except : 
        rate_percent.append(None)
    '''
    so, there are some informations within a item,
    like producers, actors and rates.
    in order to crawl them, we have to access each item page.
    in fact we have to get anchor link of each movie 
    then access it .so it will be 400 requests :
    '''
    item_url = requests.get(item.find('a' ,
                                        attrs = {'class' : 'overlay--transparent'})['href'])
    item_markup = bs(item_url.content)
    # get the actors list of each item(movie) :
    try :
        actor.append(item_markup.find('div', 
                         attrs = {'class': 'actors-list clearfix'}).get_text(',', strip=True).split(','))
    # if there was no actors, fill None instead :
    except : 
        actor.append(None)
    # get the producer of each item(movie) :
    try : 
         producer.append(item_markup.find('li',
                          attrs ={'class' : 'crew-names'}).get_text(strip=True))
    # if there was no producer, fill None instead :
    except : 
        producer.append(None)
    # get the rate of each item(movie) :
    try :
        rate.append(item_markup.find('span',
                        attrs = {'id' : 'rateCnt'}).get_text(strip=True))
    # if there was no rate, fill None instead :
    except : 
        rate.append(None)   

In [None]:
# crawl informations from contents of each div tag(movie) :
for item in items : 
    fill_fields(item)

In [None]:
# structure a dictionary of contents :
data = {
    'Title' : title,
    'Genre' : genre,
    'Production' : production,
    'Actor' : actor,
    'Producer' : producer,
    'Rate' : rate,
    'Rate_percent' : rate_percent,
    'Imdb' : imdb
}

In [None]:
# convert 'data' dictionary to pandas dataframe
df = pd.DataFrame(data)

In [None]:
# save dataframe as a csv file in local path
df.to_csv('filimo.csv')

## So... now we structured all needed contents as a csv file.
## for extract some valuable insights and see dataframe,
## open up seperated notebook, which is in same directory.
## Thanks For Checking :)