<a href="https://colab.research.google.com/github/SterlingHayden/Text-Analytics/blob/main/Data_Loader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [40]:
import pandas as pd

## Data Exploration

In [41]:
df_iphone = pd.read_excel('/content/iphone_reddit_data.xlsx')
df_pixel = pd.read_excel('/content/pixel_reddit_data.xlsx')

In [42]:
df_iphone.head()

Unnamed: 0,text,upvotes,upvote_ratio,text_type,num_comments
0,TEMP,0,0.0,0,0.0
1,Welcome to the weekly stickied WSIB thread.\n\...,0,0.5,Original Post,5.0
2,When will you think the second batch of iPhone...,1,,Comment/Reply,
3,Girlfriend wants the latest iPhone but can't a...,1,,Comment/Reply,
4,Hi! I'm in a dire need of a new iPhone and bas...,1,,Comment/Reply,


In [43]:
df_pixel.head()

Unnamed: 0,text,upvotes,upvote_ratio,text_type,num_comments
0,TEMP,0,0.0,0,0.0
1,This is the weekly photo Megathread. Photos ca...,2,0.75,Original Post,0.0
2,*If you were redirected here from a removed po...,6,0.8,Original Post,88.0
3,Shipping mega thread link is outdated. That's ...,11,,Comment/Reply,
4,What is your overnight P9PXL drain? I lose abo...,4,,Comment/Reply,


We see that the first two rows are just threads created by bots/mods. i'm going to drop those

In [44]:
df_iphone = df_iphone[2:]
df_pixel = df_pixel[2:]

## Data Preprocessing  

In [45]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def clean_text(text):
  #remove links
  text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

  #breaking down the text into individual words/tokens
  tokens = word_tokenize(text)

  #remove stopwords (e.g., "and", "the", "is")
  stop_words = set(stopwords.words('english'))
  filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

  #remove punctuation and special characters
  filtered_tokens = [re.sub(r'\W+', '', word) for word in filtered_tokens if re.sub(r'\W+', '', word)]

  #lowercase all words
  filtered_tokens = [word.lower() for word in filtered_tokens]

  #lemmatize words, similar to stemming but it is more linguistically accurate
  lemmatizer = WordNetLemmatizer()
  clean_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

  return clean_tokens


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [46]:
#applying clean_text() to the text data in r/iphone
df_iphone['cleaned_text'] = df_iphone['text'].astype(str).apply(clean_text)

#applying clean_text() to the text data in r/pixel
df_pixel['cleaned_text'] = df_pixel['text'].astype(str).apply(clean_text)

In [47]:
df_iphone['cleaned_text']

Unnamed: 0,cleaned_text
2,"[think, second, batch, iphone, 16, pro, max, a..."
3,"[girlfriend, want, latest, iphone, ca, nt, aff..."
4,"[hi, m, dire, need, new, iphone, basically, qu..."
5,"[product, apple, arrived, defective, l, return..."
6,"[ordered, mine, att, day, launch, still, getti..."
...,...
14000,"[know, annoying, pay, premium, premium, produc..."
14001,"[desert, titanium, pro, max, arrived, scratch,..."
14002,"[know, damage, happened, tiny, thing, wont, sh..."
14003,"[wow, thought, almost, impossible, defect, app..."


In [48]:
df_pixel['cleaned_text']

Unnamed: 0,cleaned_text
2,"[redirected, removed, post, megathread, link, ..."
3,"[shipping, mega, thread, link, outdated, s, la..."
4,"[overnight, p9pxl, drain, lose, 15, overnight,..."
5,"[p9pxl, averaging, 6hrs, sot, last, day, m, en..."
6,"[live, australia, jbhifi, promos, around, new,..."
...,...
12531,[deleted]
12532,"[re, hiding, mentioned, new, lens, new, sensor..."
12533,"[suggestion, camera, hardware, improved, resul..."
12534,[deleted]


In [49]:
df_pixel.to_csv('pixel_processed.csv')
df_iphone.to_csv('iphone_processed.csv')