# Introduction

- The dataset is available at: https://www.kaggle.com/datasets/kreeshrajani/human-stress-prediction.
- The motivation behind the dataset is to produce a machine learning model that can prdict if an individual is suffering from a psychological stress or not. 
- data is captured from multiple subreddits.
- The text preprocessing step will mainly analyze the "text" column and prep it for the ML analysis. This means we will attept to clean the text as much as possible.
- Common text preprocessing step includes: 
    - Lower casing:
        - The idea is to convert all the text into the same casing fromat. For example the inputs of 'hi', 'Hi', and 'HI' will all be treated the same way.
        - This is helpful when conducting non-sentiment analysis. However, it could have a slight negative impact for ML models that require the proper casing of the words. E.g., Sentiment Analysis that expects that if a word contians all upper casing characters it represents anger.
        - For this project we did not perfrom lower casing because, we are conducting a sentiment analysis.
    - Removing any URLs & Emails & HTML tags
        - Removal of URL, Emails, and HTML Tags
        - The data did not contain any emails. Therefore, np emails were removed 
    - Punctuations Removal:
        - The idea is to remove punctuations the from text data. Therefore, the same text data can be treated the same way even if some have punctuations and others do not. For example "Yo" and "Yo!!!" will  be treated the same way.
        - Also, this step includes Removal of Non-alpha characters
        - The most important point in punctuations analysis is the selection of symbols to remove.   
    - Stop Words Removal:
        - The idea is to remove all the words that commonly occur in the language. E.g., "a", "so", etc.
    - Frequest Words Removal:
        - The idea is to remove all the words that commonly occur for specfic type document or text.
        - For this project we did not perfrom frequent words removal, because TF-IDF will be taking care of this step.
    - Rare Words Removal:
        - The idea is to remove all the words that barely occur in text.
        - For this project we did not perfrom rare words removal, because TF-IDF will be taking care of this step.
    - Stemming:
        - The process of reducing inflected words to their word stem by removing the word's suffix and prefix. E.g., converting "walking" and "walks" to "walk". Stemming helps to imporve the computational time
        - In this project we did not perform Stemming instead we performed Lemmatization, because Stemming can produce non proper english words.
    - Lemmatization:
        - Similar to stemming but it does not remove the words' suffix and prefix. It transfom words to their original root word which is called lemma. E.g., converting "went" and "going" to "go". Help to imporve the computational time
        - This helps save unnecessary computational overhead in trying to decipher entire words since the meanings of most words are well-expressed by their separate lemmas.
    - Emojis Removal/Transformation:
        - Removing Emojis from text.
        - This task was not conducted in this project.
        - 😀 is an emoji
        - Note: in case of a sentement analysis like the anaysis conducted here. It is prefered that Emojis are translated to words instead of them being removed.  
    - Emoticons Removal/Transformation:
        - Removing Emoticons from text.
        - This task was not conducted in this project, because the text data did not include any Emoticons.
        - From Grammarist.com, emoticon is built from keyboard characters that when put together in a certain way represent a facial expression, an emoji is an actual image.
        - :-) is an emoticon
        - Note: in case of a sentement analysis like the anaysis conducted here. It is prefered that Emoticons are translated to words instead of them being removed.   
    - Chatwords Transformation :
        - Transforming common chatwords such as BRB to "Be Right Back".
        - This task was not conducted in this project.
    - Spelling Checking:
        - Checking the spelling of the data.



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Preprocessing the preprocessed data :) 
- Importing the Libraries
- Preping to be worked on DataSet 
- Data balancing analysis 

In [17]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import re

In [18]:
df = pd.read_csv("Data\\Stress.csv")
print(df.head)

<bound method NDFrame.head of              subreddit post_id sentence_range  \
0                 ptsd  8601tu       (15, 20)   
1           assistance  8lbrx9         (0, 5)   
2                 ptsd  9ch1zh       (15, 20)   
3        relationships  7rorpp        [5, 10]   
4     survivorsofabuse  9p2gbc         [0, 5]   
...                ...     ...            ...   
2833     relationships  7oee1t       [35, 40]   
2834              ptsd  9p4ung       [20, 25]   
2835           anxiety  9nam6l        (5, 10)   
2836    almosthomeless  5y53ya        [5, 10]   
2837              ptsd  5y25cl         [0, 5]   

                                                   text  label  confidence  \
0     He said he had not felt that way before, sugge...      1    0.800000   
1     Hey there r/assistance, Not sure if this is th...      0    1.000000   
2     My mom then hit me with the newspaper and it s...      1    0.800000   
3     until i met my new boyfriend, he is amazing, h...      1    0.6

In [19]:
df = df[['text', 'label']]
print (df.head())

                                                text  label
0  He said he had not felt that way before, sugge...      1
1  Hey there r/assistance, Not sure if this is th...      0
2  My mom then hit me with the newspaper and it s...      1
3  until i met my new boyfriend, he is amazing, h...      1
4  October is Domestic Violence Awareness Month a...      1


In [20]:
#Balancing analysis of the label column 

df["label"].value_counts()



1    1488
0    1350
Name: label, dtype: int64

- The data is balanced between 1 (True psychological issues) and 0 (False psychological issues)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Converting the data to lower case

In [136]:
#This cell to be ignored, because this is a sentiment analysis. Do not run it
# df['text_preprocessing_lower'] = df['text'].str.lower()
# df.head()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Removing URLs & HTML Tags

In [21]:
df['text_preprocessing'] = df['text'].str.replace('http\S+|www.\S+', '', case=False) #URLS 
    #https://stackoverflow.com/questions/51994254/removing-url-from-a-column-in-pandas-dataframe
df['text_preprocessing'] = df['text_preprocessing'].str.replace(r'<[^<>]*>', '', regex=True) #HTML Tags

df.head()


  df['text_preprocessing'] = df['text'].str.replace('http\S+|www.\S+', '', case=False) #URLS


Unnamed: 0,text,label,text_preprocessing
0,"He said he had not felt that way before, sugge...",1,"He said he had not felt that way before, sugge..."
1,"Hey there r/assistance, Not sure if this is th...",0,"Hey there r/assistance, Not sure if this is th..."
2,My mom then hit me with the newspaper and it s...,1,My mom then hit me with the newspaper and it s...
3,"until i met my new boyfriend, he is amazing, h...",1,"until i met my new boyfriend, he is amazing, h..."
4,October is Domestic Violence Awareness Month a...,1,October is Domestic Violence Awareness Month a...


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Removing Punctuations

In [22]:
import re
for i, x in enumerate(df['text_preprocessing']): 
    x = re.sub(r'[^\w\s]', '', x)
    df['text_preprocessing'][i] = x

df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text_preprocessing'][i] = x


Unnamed: 0,text,label,text_preprocessing
0,"He said he had not felt that way before, sugge...",1,He said he had not felt that way before sugget...
1,"Hey there r/assistance, Not sure if this is th...",0,Hey there rassistance Not sure if this is the ...
2,My mom then hit me with the newspaper and it s...,1,My mom then hit me with the newspaper and it s...
3,"until i met my new boyfriend, he is amazing, h...",1,until i met my new boyfriend he is amazing he ...
4,October is Domestic Violence Awareness Month a...,1,October is Domestic Violence Awareness Month a...


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Stop Words Removal

In [23]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop = stopwords.words('english')
# len(stop)
df['text_preprocessing'] = df['text_preprocessing'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df.head()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PZ4L6Q\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,label,text_preprocessing
0,"He said he had not felt that way before, sugge...",1,He said felt way suggeted I go rest TRIGGER AH...
1,"Hey there r/assistance, Not sure if this is th...",0,Hey rassistance Not sure right place post goes...
2,My mom then hit me with the newspaper and it s...,1,My mom hit newspaper shocked would knows I don...
3,"until i met my new boyfriend, he is amazing, h...",1,met new boyfriend amazing kind sweet good stud...
4,October is Domestic Violence Awareness Month a...,1,October Domestic Violence Awareness Month I do...


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Lemmatization 

In [24]:
df['text_preprocessing']

0       He said felt way suggeted I go rest TRIGGER AH...
1       Hey rassistance Not sure right place post goes...
2       My mom hit newspaper shocked would knows I don...
3       met new boyfriend amazing kind sweet good stud...
4       October Domestic Violence Awareness Month I do...
                              ...                        
2833    Her week ago Precious I ignored Her Jan 1 Happ...
2834    I dont ability cope anymore Im trying lot thin...
2835    In case first time youre reading post We looki...
2836    Do find normal They good relationship Main pro...
2837    I talking mom morning said sister Her trauma w...
Name: text_preprocessing, Length: 2838, dtype: object

In [25]:
import nltk
nltk.download('wordnet')

w_tokenizer = nltk.tokenize.WhitespaceTokenizer() #extract the tokens from stream of words,without whitespaces, new line and tabs
lemmatizer = nltk.stem.WordNetLemmatizer() #Perfrom the lemmatization 

def lemmatize_text(text):
    st = ""
    for w in w_tokenizer.tokenize(text):
        st = st + lemmatizer.lemmatize(w) + " "
    return st


df['text_preprocessing'] = df.text_preprocessing.apply(lemmatize_text)

df.head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PZ4L6Q\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text,label,text_preprocessing
0,"He said he had not felt that way before, sugge...",1,He said felt way suggeted I go rest TRIGGER AH...
1,"Hey there r/assistance, Not sure if this is th...",0,Hey rassistance Not sure right place post go I...
2,My mom then hit me with the newspaper and it s...,1,My mom hit newspaper shocked would know I dont...
3,"until i met my new boyfriend, he is amazing, h...",1,met new boyfriend amazing kind sweet good stud...
4,October is Domestic Violence Awareness Month a...,1,October Domestic Violence Awareness Month I do...


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Checking the spelling of the data

In [26]:

from spellchecker import SpellChecker

spell = SpellChecker(distance=1) #distance of 1 from the original word

def Correct(x):
    return spell.correction(x)

for i, x in enumerate(df['text_preprocessing']):
    x = Correct(x)
    df['text_preprocessing'][i] = x

df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text_preprocessing'][i] = x


Unnamed: 0,text,label,text_preprocessing
0,"He said he had not felt that way before, sugge...",1,He said felt way suggeted I go rest TRIGGER AH...
1,"Hey there r/assistance, Not sure if this is th...",0,Hey rassistance Not sure right place post go I...
2,My mom then hit me with the newspaper and it s...,1,My mom hit newspaper shocked would know I dont...
3,"until i met my new boyfriend, he is amazing, h...",1,met new boyfriend amazing kind sweet good stud...
4,October is Domestic Violence Awareness Month a...,1,October Domestic Violence Awareness Month I do...


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Finalize the data

In [27]:
#dropping empty rows. Then, resetting the index
print(len(df))

for x in df: 
    df[x] = df[x].replace(r'^\s*$', np.nan, regex=True)

df=df.dropna()

# df = df.reset_index(drop=True)

print(len(df))
print (df.head())

2838
2837
                                                text  label  \
0  He said he had not felt that way before, sugge...      1   
1  Hey there r/assistance, Not sure if this is th...      0   
2  My mom then hit me with the newspaper and it s...      1   
3  until i met my new boyfriend, he is amazing, h...      1   
4  October is Domestic Violence Awareness Month a...      1   

                                  text_preprocessing  
0  He said felt way suggeted I go rest TRIGGER AH...  
1  Hey rassistance Not sure right place post go I...  
2  My mom hit newspaper shocked would know I dont...  
3  met new boyfriend amazing kind sweet good stud...  
4  October Domestic Violence Awareness Month I do...  


In [28]:
df['label']

0       1
1       0
2       1
3       1
4       1
       ..
2833    0
2834    1
2835    0
2836    0
2837    1
Name: label, Length: 2837, dtype: int64

In [29]:
df["label"] = df["label"].replace([0, 1],['No psychological issues', 'psychological issues'])


In [30]:
df['label']


0          psychological issues
1       No psychological issues
2          psychological issues
3          psychological issues
4          psychological issues
                 ...           
2833    No psychological issues
2834       psychological issues
2835    No psychological issues
2836    No psychological issues
2837       psychological issues
Name: label, Length: 2837, dtype: object

In [31]:
df = df [['text_preprocessing', 'label']]
df.to_csv("Data/ml_nlp.csv")

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------