# What's on TV Tonight ?

This notebook groups every steps of the project :
- web scraping the french tv programme and get all the desired informations
    - channel number & channel name
    - programme name / title
    - beginning & duration time
    - rate /5 given by "TéléLoisirs"
    - synopsis of the programme
    - year of release
    
    
- display programmes by type (the user chooses the type of programme he wants to see), and sorted by rate for movies 


The code also works with what is on TV right now ("En ce moment" table on the main web page) and  for different TV guide (Orange, SFR, Bouygues etc.)

## Libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup 

## URL links & Constants

In [2]:
URL_TNT = "https://www.programme-tv.net/"
URL_CANAL = "https://www.programme-tv.net/programme/canal-plus/"
URL_NOW = "https://www.programme-tv.net/programme/en-ce-moment.html"
URL_TOMORROW = "https://www.programme-tv.net/programme/toutes-les-chaines/2020-12-23/" #change everyday

CHANNEL_NAME = "channel_name"
CHANNEL_NUMBER = "channel_number"
BEGINNING_HOUR = "beginning_hour"
PROGRAMME_TITLE = "programme_title"
PROGRAMME_TYPE = "programme_type" 
PROGRAMME_DURATION = "programme_duration"
SYNOPSIS = "synopsis"
RATE = "rate /5"

## Data retrieval

###### Get data for TNT (Télévision Numérique Terrestre) 

In [3]:
def loader_tnt(url_link: str) -> pd.DataFrame:
    
    # Set up scraping tool
    url_basic = url_link
    response_basic = requests.get(url_basic)
    soup = BeautifulSoup(response_basic.text, 'html.parser')
    info_box = soup.find_all("div", class_="doubleBroadcastCard")
    tv_programme = []
    
    for i in range(len(info_box)):
        # Get first useful informations on the main page
        channel_name    = info_box[i].find("a", class_="doubleBroadcastCard-channelName").text.strip()
        channel_number  = getattr(info_box[i].find("div", class_="doubleBroadcastCard-channelNumber"), 
                                  'text', None)
        beginning_hour  = info_box[i].find("div", class_="doubleBroadcastCard-hour").text.strip()
        programme_title = info_box[i].find("a", class_="doubleBroadcastCard-title").text.strip()
        programme_type  = info_box[i].find("div", class_="doubleBroadcastCard-type").text.strip()
        prgrm_duration  = info_box[i].find("div", class_="doubleBroadcastCard-durationContent").text.strip()
        url_link        = info_box[i].find("div", class_="doubleBroadcastCard-infos").a['href']

        # ReDo requests.get() to go through the link of the program to get new infos (synopsis / opinion) ...
        url_synopsis = url_link
        response_url_synopsis = requests.get(url_synopsis)
        soup_synopsis = BeautifulSoup(response_url_synopsis.text, 'html.parser')
        
        # ... based on different program type (movies, film, series) because the html structure is different
        movies_box = soup_synopsis.find_all("div", class_="programHome-mainContent")
        sport_box = soup_synopsis.find_all("div", class_="matchDetails")
        serie_box = soup_synopsis.find_all("div", class_="programCollectionEpisode-mainContent")
        header_box = soup_synopsis.find_all("section", class_="programHome-overview")
        
        # Create 3 variables of synopsis_type : it fills when condition is met and stays blank if not 
        # Get synopsis of film OR summary (mostly for sports, culture program, series & others)
        try:
            synopsis_film = (movies_box[0].find("div", class_="synopsis-text defaultStyleContentTags")
                            ).text.strip()
        except:
            synopsis_film = " "
            
        # Sports synopsis
        try:
            sport_details = (sport_box[0].find("p", class_="matchDetails-synopsis defaultStyleContentTags")
                            ).text.strip()
        except:
            sport_details = " "
      
        # Series synopsis 
        try:
            synopsis_serie = (serie_box[0].find("div", class_="synopsis-text defaultStyleContentTags")
                             ).text.strip()
        except:
            synopsis_serie = " "    
        
        # Get str(opinion) of "TéléLoisirs"
        try:
            opinion = getattr(movies_box[0].find("div", class_="review-loveText"), 'text', str(None))
        except:
            opinion = "None"
            # I put "None" that will be replace further because it may arrived that a movie has no rate -->
            # ... so I won't be able to sort by "RATE" if I have an empty str()
            
        # Get year of release
        try:
            genre = getattr(header_box[0].find("div", class_="overview-overviewSubtitle"), 'text', str(None))
            find_integer = str(" (") + str([int(s) for s in genre.split() if s.isdigit()][0]) + str(")")
            # print () to surround the year of release 
            # [0] to get only the digit without [] --> 2020 instead of [2020]
        except:
            find_integer = " "

        info = [channel_name, programme_title, programme_type, channel_number, beginning_hour, prgrm_duration, 
                opinion, synopsis_film, sport_details, synopsis_serie, find_integer]
        
        tv_programme.append(info)
       
    # Build a DataFrame from scraped data
    tv_programme = pd.DataFrame(tv_programme, columns=[CHANNEL_NAME, PROGRAMME_TITLE, PROGRAMME_TYPE, 
                                                       CHANNEL_NUMBER, BEGINNING_HOUR, PROGRAMME_DURATION, 
                                                       RATE, "synopsis_film", "sport_details", 
                                                       "synopsis_serie", "year"])
    
    # Group all synopsis type (movies, sport, series) into one unique "synopsis" variable    
    tv_programme[SYNOPSIS] = (tv_programme["synopsis_film"] + tv_programme["sport_details"] 
                              + tv_programme["synopsis_serie"])
        
    # Replace str(opinion) by an integer representing the rate / 5 of the programme ...
    # couldn't scrape this variable on the site because the int is a <svg ...> (technical / knowledge issue)
    # /!\ 0 means NO RATE ; not that the porgramme is very very bad 
    tv_programme[RATE].replace({"\n                    À ne pas manquer\n                " : 5, 
                                "\n                    Très bon\n                "         : 4, 
                                "\n                    Bon\n                "              : 3, 
                                "\n                    Assez bon\n                "        : 2, 
                                "\n                    Décevant\n                "         : 1, 
                                "None": 0}, inplace=True)

    # Concatenate Title & Year of release
    tv_programme[PROGRAMME_TITLE] = tv_programme[PROGRAMME_TITLE] + tv_programme["year"]
    
    # Drop intermediate columns    
    tv_programme.drop(columns=["synopsis_film", "sport_details", "synopsis_serie", "year"], axis=1, 
                      inplace=True)
    return tv_programme

In [4]:
programme_tnt = loader_tnt(URL_TNT)
# programme_tnt

###### Get data for "Le bouquet de Canal" (private channel - not necessary if you didn't subscribe) 

In [5]:
# Same function as loader_tnt with one additionnal command : lines 63-65
def loader_canal(url_link: str) -> pd.DataFrame:
    
    url_basic = url_link
    response_basic = requests.get(url_basic)
    soup = BeautifulSoup(response_basic.text, 'html.parser')
    info_box = soup.find_all("div", class_="doubleBroadcastCard")
    tv_programme = []
    
    for i in range(len(info_box)):
        channel_name    = info_box[i].find("a", class_="doubleBroadcastCard-channelName").text.strip()
        channel_number  = ""
        beginning_hour  = info_box[i].find("div", class_="doubleBroadcastCard-hour").text.strip()
        programme_title = info_box[i].find("a", class_="doubleBroadcastCard-title").text.strip()
        programme_type  = info_box[i].find("div", class_="doubleBroadcastCard-type").text.strip()
        prgrm_duration  = info_box[i].find("div", class_="doubleBroadcastCard-durationContent").text.strip()
        url_link        = info_box[i].find("div", class_="doubleBroadcastCard-infos").a['href']
        
        url_synopsis = url_link
        response_url_synopsis = requests.get(url_synopsis)
        soup_synopsis = BeautifulSoup(response_url_synopsis.text, 'html.parser')
        
        movies_box = soup_synopsis.find_all("div", class_="programHome-mainContent")
        sport_box = soup_synopsis.find_all("div", class_="matchDetails")
        serie_box = soup_synopsis.find_all("div", class_="programCollectionEpisode-mainContent")
        header_box = soup_synopsis.find_all("section", class_="programHome-overview")
        
        try:
            synopsis_film = (movies_box[0].find("div", class_="synopsis-text defaultStyleContentTags")
                            ).text.strip()
        except IndexError:
            synopsis_film = " "
            
        try:
            sport_details = (sport_box[0].find("p", class_="matchDetails-synopsis defaultStyleContentTags")
                            ).text.strip()
        except IndexError:
            sport_details = " "
      
        try:
            synopsis_serie = (serie_box[0].find("div", class_="synopsis-text defaultStyleContentTags")
                             ).text.strip()
        except IndexError:
            synopsis_serie = " "    
        
        try:
            opinion = getattr(movies_box[0].find("div", class_="review-loveText"), 'text', str(None))
        except IndexError:
            opinion = "None"
        
        try:
            genre = getattr(header_box[0].find("div", class_="overview-overviewSubtitle"), 'text', str(None))
            find_integer = str(" (") + str([int(s) for s in genre.split() if s.isdigit()][0]) + str(")")
        except IndexError:
            find_integer = " "

        info = [channel_name, programme_title, programme_type, channel_number, beginning_hour, prgrm_duration, 
                opinion, synopsis_film, sport_details, synopsis_serie, find_integer]
        
        tv_programme.append(info)
       
    tv_programme = pd.DataFrame(tv_programme, columns=[CHANNEL_NAME, PROGRAMME_TITLE, PROGRAMME_TYPE, 
                                                       CHANNEL_NUMBER, BEGINNING_HOUR, PROGRAMME_DURATION, 
                                                       RATE, "synopsis_film", "sport_details", 
                                                       "synopsis_serie", "year"])
    
    tv_programme[SYNOPSIS] = (tv_programme["synopsis_film"] + tv_programme["sport_details"] 
                              + tv_programme["synopsis_serie"])
    
    # channel_number is not given by the website for canal channels --> enter values by hand (for my version)
    tv_programme[CHANNEL_NUMBER] = ["Chaîne n°4", "Chaîne n°40", "Chaîne n°41", "Chaîne n°42", 
                                    "Chaîne n°43", "Chaîne n°44"]
    
    tv_programme[RATE].replace({"\n                    À ne pas manquer\n                " : 5, 
                                "\n                    Très bon\n                "         : 4, 
                                "\n                    Bon\n                "              : 3, 
                                "\n                    Assez bon\n                "        : 2, 
                                "\n                    Décevant\n                "         : 1, 
                                "None": 0}, inplace=True)

    tv_programme[PROGRAMME_TITLE] = tv_programme[PROGRAMME_TITLE] + tv_programme["year"]
    
    tv_programme.drop(columns=["synopsis_film", "sport_details", "synopsis_serie", "year"], axis=1, 
                      inplace=True)
        
    return tv_programme

In [6]:
programme_canal = loader_canal(URL_CANAL)
# programme_canal

###### Concatenate the 2 dataframes (TNT + "Le bouquet de Canal")

In [7]:
def concatenate_frames(df_tnt: pd.DataFrame, df_canal: pd.DataFrame) -> pd.DataFrame:
    return pd.concat([df_tnt, df_canal]).drop_duplicates(subset ="channel_name")

In [8]:
programme_tv = concatenate_frames(programme_tnt, programme_canal)
# programme_tv.head()

## Display best programmes by categorie

###### List of today's programme type ---> choose what you want to watch tonight 

In [9]:
programme_type_of_the_day = programme_tv[PROGRAMME_TYPE].unique().tolist()
programme_type_of_the_day

['Autre', 'Téléfilm', 'Sport', 'Culture Infos', 'Série TV', 'Cinéma']

###### Select a dataframe (TNT, Le bouquet de Canal or both) and the programme type you want to watch

In [10]:
def choose_programme_type(df: pd.DataFrame, programme_type: str) -> pd.DataFrame:    
    pd.set_option("display.max_colwidth", -1)
    df_chosed = df.loc[df[PROGRAMME_TYPE] == programme_type]
    df_chosed = df_chosed.drop(columns=[PROGRAMME_TYPE]).set_index([CHANNEL_NAME, CHANNEL_NUMBER])
    return df_chosed.sort_values(by=RATE, ascending=False)#.head(3)

In [11]:
selected_type = choose_programme_type(programme_tv, "Cinéma")
selected_type

Unnamed: 0_level_0,Unnamed: 1_level_0,programme_title,beginning_hour,programme_duration,rate /5,synopsis
channel_name,channel_number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Canal+ Cinéma,Chaîne n°40,Seules les bêtes (2019),20h52,1h53min,3,"Dans les Causses, une femme disparaît durant une nuit neigeuse. Femme d'agriculteur, Alice aperçoit le véhicule un jour, en rentrant de chez Joseph, son amant. Les gendarmes ne parviennent pas à retrouver le corps de la disparue. Intriguée, Alice retourne chez Joseph. Ce dernier lui explique que quelqu'un a tué son chien et la repousse avec brusquerie. Dans le même temps, Michel, le mari d'Alice, passe beaucoup de temps sur internet, à converser avec une belle jeune femme. Il ne se doute pas qu'il s'agit en fait d'une arnaque orchestrée de Côte d'Ivoire."
Canal+ Family,Chaîne n°44,Hulk (2003),20h52,2h13min,3,"Alors qu'il effectue une expérience dans son laboratoire de San Francisco, le généticien Bruce Banner se retrouve exposé par inadvertance à une surdose de rayons gamma. Au grand soulagement de sa collaboratrice Betty Ross, Bruce en sort indemne. Mais peu après, il constate que quelque chose a changé en lui : à chaque émotion forte, il se transforme en un monstre incontrôlable, à la force surhumaine."


### Few ints for the webapp

[Advanced callbacks](https://dash.plotly.com/advanced-callbacks)

[Return a dataframe as a data_table from a callback](https://stackoverflow.com/questions/55269763/return-a-pandas-dataframe-as-a-data-table-from-a-callback-with-plotly-dash-for-p/55305812#55305812)

[Return df with dash](https://dash.plotly.com/datatable/editable)

In [17]:
choices = {
    "tonight":["https://www.programme-tv.net/", "https://www.programme-tv.net/programme/canal-plus/"],
    "now"    :["https://www.programme-tv.net/programme/en-ce-moment.html", 
               "https://www.programme-tv.net/programme/canal-plus/en-ce-moment.html"]
}