# Working with Data

In this notebook, we are going to work on the datasets that are going to be used in order to fine tune our NLP models.  
The different steps that we are going to follow are:

- Scraping ads data from Moroccan youtube products channels
- Clean the data in order to show the ads content that is in Moroccan dialect

## Import Libraries

In [1]:
import pandas as pd
import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

## Scrapping Data

In [2]:
# define a function to scrape the data
def scrape_ads(url):
    ''' Takes a URL to a youtube channel and returns a list of ads from the channel
    '''
    # Launch the Chrome driver
    driver = webdriver.Chrome('../drivers/chromedriver')
    driver.get(url)

    # Scroll down to load more videos
    height = driver.execute_script("return document.documentElement.scrollHeight")
    start_time = time.time()
    while True:
        driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight)")
        time.sleep(1)
        new_height = driver.execute_script("return document.documentElement.scrollHeight")
        if new_height == height:
            break
        height = new_height
        if time.time() - start_time >= 5:
            break

    # Scrape links to individual videos
    src = driver.page_source 
    soup = BeautifulSoup(src, 'lxml')
    pub = soup.find_all('a', {'id': 'video-title-link'})
    links=[]
    for link in pub:
        href = link.get('href')
        if href is not None:
            links.append("https://www.youtube.com/" + href)

    # Scrape ads from each individual video
    ads=[]
    for link in links[:-1]:
        driver.get(link)
        time.sleep(2)
        src = driver.page_source 
        soup = BeautifulSoup(src, 'lxml')
        ad = soup.find('span', {'class' : 'yt-core-attributed-string yt-core-attributed-string--white-space-pre-wrap'})
        if ad:
            ads.append(ad.text)

    # Quit the driver and return the list of ads
    driver.quit()
    return ads

In [None]:
# get the ads for the most famous channels
channels = ['orangemaroc', 'maroctelecom', 'BimoOfficiel', 'inwi']

# create a dataframe to store the ads
df = pd.DataFrame(columns=['ad'])

for channel in channels:
    ads = scrape_ads("https://www.youtube.com/" + channel + "/videos")
    df = df.append(pd.DataFrame(ads, columns=['ad']), ignore_index=True)

In [None]:
# inspect the data
df.head()

# Data cleaning

In [26]:
import re

def clean_text(text):
    ''' Takes raw text as input and returns cleaned text
    '''
    # Remove emojis
    text = re.sub('[^\u0600-\u06FF\s]+', '', text)
    # Remove non-Arabic letters and numbers
    text = re.sub('[^؀-ۿ\s]+', '', text)
    # Remove extra spaces
    text = re.sub('\s+', ' ', text)
    return text.strip()

In [27]:
# apply the function to the ads column
df['ads'] = df['ads'].apply(clean_text)

In [34]:
# inspect the data
df.head()

Unnamed: 0,ads_clean
0,حصرياً و غير عند اورنج، عيش لفيبر فدارك ونتا ه...
1,تعرفوا معنا على سر الحرفة مع فاطمة، عمران و هش...
2,بفضل الخطوات ديالكم وزعنا الاأنترنت على عدة جم...
3,الساعة جديدة هادي ولا القديمة ؟ شكون فعائلتك و...
4,حيت عندنا ديما نتا لول، تبرع بالماكس ديال السخ...
...,...
485,
486,
487,
488,


We see some data with blank content, we are going to remove them.

In [35]:
# remove blank ads
df = df[df['ads'] != '']

Lastly, we are going to save the data in a csv file.

In [37]:
df.to_csv("../data/ads.csv", index=False)