In [1]:
import re

### importing unescape to decode html symbols
from html import unescape

### Importing pandas to organise the data for analysis
import pandas as pd

### Importing spaCy for preprocessing the data
import en_core_web_sm
nlp = en_core_web_sm.load()

In [2]:
### Grabs .CSV file containing the Reddit data
file = 'data/scrapedRedditData5000.csv'
df = pd.read_csv(file, encoding='utf-8', index_col=0)

with open('data/scrapedRedditData5000.csv') as f:
    print(f)


<_io.TextIOWrapper name='data/scrapedRedditData5000.csv' mode='r' encoding='utf-8'>


In [7]:
#for language detection to filter non-english entries
from langdetect import detect

### function for tokenising (segmenting) and lemmatising (stemming, changing text to the root words) text in lowercase
def clean(text):
    # decode html symbols
    text = unescape(text)

    english = detect(text) == 'en' if isinstance(text, str) else False
    if not english:
        return '' # returns None if the text is not in English.
    #makes text lowercase
    # text = text.lower()

    #removes URLs in text
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    #removes mentions (@)
    text = re.sub(r'@\w+', '', text)

    # removes numbers
    text = re.sub(r'\d+', '', text)

    # Remove specific characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)  # Remove non-ASCII characters
    text = re.sub(r'[â€™]+', '', text)  # Replace 'â€™' with '
    print(text)
    
    #remove '\-' from the data
    cleaned_data = re.sub(r'\\-', '', text)

    # Remove hashtags
    text = re.sub(r'#[A-Za-z0-9]+', '', text)  # Remove hashtags
    doc = nlp(text)

    # Lemmatises the tweets, removes hashtags and punctuations
    lemmatize = " ".join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct])

    cleaned_text = lemmatize

    return cleaned_text

In [8]:
### and fills in empty values in the text column
df['text'] = df['text'].astype(str)
# df['text'].fillna('', inplace=True)
print(df['text'].head())
# df.fillna({'text': ''}, inplace=True)

#Removes empty ['text'] rows
df['text'].replace('nan', pd.NA, inplace=True)
df.dropna(subset=['text'], inplace=True)
print(df['text'].head())

0    I've seen a ton of negative reviews:  \n\-Easi...
2    I thought iPhone 13 Pro and 14 Pro series alre...
3    Also, how's the camera and battery life?\n\nfe...
4    After three long years and handful of android ...
6    Anyone here with iphone 15 pro facing absolute...
Name: text, dtype: object
0    I've seen a ton of negative reviews:  \n\-Easi...
2    I thought iPhone 13 Pro and 14 Pro series alre...
3    Also, how's the camera and battery life?\n\nfe...
4    After three long years and handful of android ...
6    Anyone here with iphone 15 pro facing absolute...
Name: text, dtype: object


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['text'].replace('nan', pd.NA, inplace=True)


In [9]:
### Text Pre-Processing
df['cleaned_title'] = df['title'].apply(clean)

Is it worth getting the iPhone ?
Why is the demand for the iPhone  series so high this year?
Has anyone got the base model of the iPhone /plus, what are your thoughts?
 Plus thoughts - back to iPhone after  years
Anyone bought iphone  pro
What do yall think is the Best iphone  pro max color?
I found an iPhone  max
Temperature of my iPhone  Pro Max while on the phone for  mins.
: iPhone  vs iPhone  Pro
iPhone  Pro Max Burn In
Please dont buy an iPhone  today
iPhone  Pro Max arrived cracked
Apple FineWoven iPhone /iPhone  Pro Case Review - $ For This?
Shot on my iPhone  Pro Max
iPhone  Alarms Failing To Go Off - Is This Happening To Anyone Else?
After making me wait two weeks, here's the replacement cable apple sent me. (I have an iPhone ) 
UPGRADED! iPhone  to iPhone  Pro Max 
After waiting two weeks, apple sends me this replacement cable for my iPhone . 
PSA: iPhone  Pro/Pro Max Titanium Scratches
I had an iPhone  Pro for  hours!
iPhone  Pro Camera is.. wtf?
iPhone  Pro Max
The first P

In [10]:
df['cleaned_text'] = df['text'].apply(clean)

I've seen a ton of negative reviews:  
\-Easily cracking (the most trending example at the moment)  
\-Battery easily draining  
\-Overheating  
\-Buggy software  


Is it even worth buying???
I thought iPhone  Pro and  Pro series already had pretty high demand given they were hard to get for months. But the  Pros got backordered within ~ minutes, and I havent seen Apple Stores everywhere this busy in a long time. 

The  Pros have nice improvements, but nothing particularly groundbreaking or flashy compared to the  Pros. The regular s got a really good upgrade this year and nice colors, but those arent the ones that seem to be selling out as fast. 

So, what do you think is making so many people upgrade this year?
Also, how's the camera and battery life?

feel free to share pics in the comment of the color you got (I'm having a hard time finding some that are not about the  Pro/Pro Max lol)
After three long years and handful of android phones, I am finally back to using an iPhone. Brie

In [11]:
### and fills in empty values in the text column
df['text'] = df['text'].astype(str)
# df['text'].fillna('', inplace=True)
print(df['text'].head())
# df.fillna({'text': ''}, inplace=True)

#Removes empty ['text'] rows
df['text'].replace('nan', pd.NA, inplace=True)
df.dropna(subset=['text'], inplace=True)
print(df['text'].head())

0    I've seen a ton of negative reviews:  \n\-Easi...
2    I thought iPhone 13 Pro and 14 Pro series alre...
3    Also, how's the camera and battery life?\n\nfe...
4    After three long years and handful of android ...
6    Anyone here with iphone 15 pro facing absolute...
Name: text, dtype: object
0    I've seen a ton of negative reviews:  \n\-Easi...
2    I thought iPhone 13 Pro and 14 Pro series alre...
3    Also, how's the camera and battery life?\n\nfe...
4    After three long years and handful of android ...
6    Anyone here with iphone 15 pro facing absolute...
Name: text, dtype: object


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['text'].replace('nan', pd.NA, inplace=True)


In [12]:
### Saving the pre-processed data as a new .CSV file for sentiment analysis
df.to_csv('data/englishconverted_cleanedRedditData5000.csv')