### About The Data
- EDA will be conducted on the description columns 
- The dataset is available at: https://www.kaggle.com/datasets/gpreda/bbc-news?select=bbc_news.csv.
- The motivation behind the dataset is to produce a machine learning model that can prdict the topic of a new article. 
- data is captured from the BBC website.
- The text preprocessing step will mainly analyze the "description" column and prep it for the ML analysis. This means we will attept to clean the text as much as possible.
- Common text preprocessing step includes: 
    - Lower casing:
        - The idea is to convert all the text into the same casing format. For example the inputs of 'hi', 'Hi', and 'HI' will all be treated the same way.
        - This is helpful when conducting non-sentiment analysis. However, it could have a slight negative impact for ML models that require the proper casing of the words. E.g., Sentiment Analysis that expects that if a word contians all upper casing characters it represents anger.
        - For this project we did not perfrom lower casing because, we are conducting a sentiment analysis.
    - Removing any URLs & Emails & HTML tags
        - Removal of URL, Emails, and HTML Tags
        - The data did not contain any emails. Therefore, np emails were removed 
    - Punctuations Removal:
        - The idea is to remove punctuations the from text data. Therefore, the same text data can be treated the same way even if some have punctuations and others do not. For example "Yo" and "Yo!!!" will  be treated the same way.
        - Also, this step includes Removal of Non-alpha characters
        - The most important point in punctuations analysis is the selection of symbols to remove.   
    - Stop Words Removal:
        - The idea is to remove all the words that commonly occur in the language. E.g., "a", "so", etc.
    - Frequest Words Removal:
        - The idea is to remove all the words that commonly occur for specfic type document or text.
        - For this project we did not perfrom frequent words removal, because TF-IDF will be taking care of this step.
    - Rare Words Removal:
        - The idea is to remove all the words that barely occur in text.
        - For this project we did not perfrom rare words removal, because TF-IDF will be taking care of this step.
    - Stemming:
        - The process of reducing inflected words to their word stem by removing the word's suffix and prefix. E.g., converting "walking" and "walks" to "walk". Stemming helps to imporve the computational time
        - In this project we did not perform Stemming instead we performed Lemmatization, because Stemming can produce non proper english words.
    - Lemmatization:
        - Similar to stemming but it does not remove the words' suffix and prefix. It transfom words to their original root word which is called lemma. E.g., converting "went" and "going" to "go". Help to imporve the computational time
        - This helps save unnecessary computational overhead in trying to decipher entire words since the meanings of most words are well-expressed by their separate lemmas.
    - Emojis Removal/Transformation:
        - Removing Emojis from text.
        - This task was not conducted in this project.
        - ðŸ˜€ is an emoji
        - Note: in case of a sentement analysis like the analysis, it is prefered that Emojis are translated to words instead of them being removed.  
    - Emoticons Removal/Transformation:
        - Removing Emoticons from text.
        - This task was not conducted in this project.
        - From Grammarist.com, emoticon is built from keyboard characters that when put together in a certain way represent a facial expression, an emoji is an actual image.
        - :-) is an emoticon
        - Note: in case of a sentement analysis ot is prefered that Emoticons are translated to words instead of them being removed.   
    - Chatwords Transformation :
        - Transforming common chatwords such as BRB to "Be Right Back".
        - This task was not conducted in this project.
    - Spelling Checking:
        - Checking the spelling of the data.



-----------------------

### Import the libraries 

In [8]:
#import the required libearies 
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import re

-----------------------

### Import the data & Perform Basic Analyzation

In [36]:
#Import the data 
df = pd.read_csv("Data/bbc_news.csv")
df.head()


Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


In [37]:
#convert all whitespaces into nan. 
df = df.replace(r'^\s*$', np.nan, regex=True)

In [38]:
#drop any nan
df = df.dropna() 

In [39]:
#Checkout the following:
#Number of rows --> 17796 
#null values in anyrow --> none
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17796 entries, 0 to 17795
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        17796 non-null  object
 1   pubDate      17796 non-null  object
 2   guid         17796 non-null  object
 3   link         17796 non-null  object
 4   description  17796 non-null  object
dtypes: object(5)
memory usage: 695.3+ KB


In [40]:
#Create a sub dataframe from the main data frame. This sub dataframe is what the ML algorithm will be executed on
df = df[['title','description']]
df

Unnamed: 0,title,description
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...
...,...,...
17791,England v Ireland: Stuart Broad says bowlers r...,Stuart Broad says England's attack is ready to...
17792,"England v Ireland: Stuart Broad, Zak Crawley &...",Watch highlights as Stuart Broad's five-wicket...
17793,Phil Neville: Inter Miami sack coach after 10 ...,Major League Soccer side Inter Miami sack coac...
17794,French Open: Jessica Pegula column on getting ...,"American Jessica Pegula, seeded third in the F..."


In [41]:
#View the average length of each entry of the description column 
decription_length = []
for x in df['description']:
    # print (len(x))
    decription_length.append(len(x))

    # print (len(df['description'])[x])


from statistics import mean

print ('The average length of each entry of the description column is ', mean(decription_length))


The average length of each entry of the description column is  104.74404360530457


-----------------------

### Perform EDA on the data

In [42]:
df.describe()

Unnamed: 0,title,description
count,17796,17796
unique,17033,16718
top,Ukraine war in maps: Tracking the Russian inva...,How closely have you been paying attention to ...
freq,32,52


- There are 17796 news articles from March 2023 to June 2023 

In [43]:
#Removing the lower cased from the description column. We will not run this cell, because this is a Sentiment Analysis  
# df['description'] = df['description'].str.lower()
# df.head()

In [44]:
#Removing the URLs & HTML Tags
df['description_text'] = df['description'].str.replace('http\S+|www.\S+', '', case=False) #URLS 
    #https://stackoverflow.com/questions/51994254/removing-url-from-a-column-in-pandas-dataframe
df['description_text'] = df['description'].str.replace(r'<[^<>]*>', '', regex=True) #HTML Tags

df.head()

  df['description_text'] = df['description'].str.replace('http\S+|www.\S+', '', case=False) #URLS


Unnamed: 0,title,description,description_text
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as...","Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...,Consumers are feeling the impact of higher ene...


In [45]:
#Removing the Punctuations 
for i, x in enumerate(df['description_text']): 
    x = re.sub(r'[^\w\s]', '', x) #Sub '' by anything that is not  a word and Not a white space
    df['description_text'][i] = x

df.head()

Unnamed: 0,title,description,description_text
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as...",Jeremy Bowen was on the frontline in Irpin as ...
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...,One of the worlds biggest fertiliser firms say...
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...,The parents of the Manchester Arena bombings y...
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...,Consumers are feeling the impact of higher ene...


In [47]:
#Stop words removal 
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop = stopwords.words('english')

print ('The length of stop words is ',len(stop))

df['description_text'] = df['description_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df.head()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PZ4L6Q\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The length of stop words is  179


Unnamed: 0,title,description,description_text
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...,The Ukrainian president says country forgive f...
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as...",Jeremy Bowen frontline Irpin residents came Ru...
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...,One worlds biggest fertiliser firms says confl...
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...,The parents Manchester Arena bombings youngest...
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...,Consumers feeling impact higher energy costs f...


In [49]:
#Lemmatization 

import nltk
nltk.download('wordnet')

w_tokenizer = nltk.tokenize.WhitespaceTokenizer() #extract the tokens from stream of words, without whitespaces, new line and tabs
lemmatizer = nltk.stem.WordNetLemmatizer() #Perfrom the lemmatization 

def lemmatize_text(text):
    st = ""
    for w in w_tokenizer.tokenize(text):
        st = st + lemmatizer.lemmatize(w) + " "
    return st


df['description_text'] = df.description_text.apply(lemmatize_text)

df.head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PZ4L6Q\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,title,description,description_text
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...,The Ukrainian president say country forgive fo...
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as...",Jeremy Bowen frontline Irpin resident came Rus...
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...,One world biggest fertiliser firm say conflict...
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...,The parent Manchester Arena bombing youngest v...
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...,Consumers feeling impact higher energy cost fu...


In [50]:
#Checkig the spelling of text 

from spellchecker import SpellChecker

spell = SpellChecker(distance=1) #distance of 1 from the original word

def Correct(x):
    return spell.correction(x)

for i, x in enumerate(df['description_text']):
    x = Correct(x)
    df['description_text'][i] = x

df.head()

Unnamed: 0,title,description,description_text
0,Ukraine: Angry Zelensky vows to punish Russian...,The Ukrainian president says the country will ...,The Ukrainian president say country forgive fo...
1,War in Ukraine: Taking cover in a town under a...,"Jeremy Bowen was on the frontline in Irpin, as...",Jeremy Bowen frontline Irpin resident came Rus...
2,Ukraine war 'catastrophic for global food',One of the world's biggest fertiliser firms sa...,One world biggest fertiliser firm say conflict...
3,Manchester Arena bombing: Saffie Roussos's par...,The parents of the Manchester Arena bombing's ...,The parent Manchester Arena bombing youngest v...
4,Ukraine conflict: Oil price soars to highest l...,Consumers are feeling the impact of higher ene...,Consumers feeling impact higher energy cost fu...


--------------------

### Finalize The Data and Export the CSV File

In [51]:
print (df.head())
df.to_csv("Data/bbc_news_post_eda.csv")

                                               title  \
0  Ukraine: Angry Zelensky vows to punish Russian...   
1  War in Ukraine: Taking cover in a town under a...   
2         Ukraine war 'catastrophic for global food'   
3  Manchester Arena bombing: Saffie Roussos's par...   
4  Ukraine conflict: Oil price soars to highest l...   

                                         description  \
0  The Ukrainian president says the country will ...   
1  Jeremy Bowen was on the frontline in Irpin, as...   
2  One of the world's biggest fertiliser firms sa...   
3  The parents of the Manchester Arena bombing's ...   
4  Consumers are feeling the impact of higher ene...   

                                    description_text  
0  The Ukrainian president say country forgive fo...  
1  Jeremy Bowen frontline Irpin resident came Rus...  
2  One world biggest fertiliser firm say conflict...  
3  The parent Manchester Arena bombing youngest v...  
4  Consumers feeling impact higher energy cost fu..