Data Import

<div class="alert alert-block alert-danger">



Date: 21/04/2024

Environment: Used Google Colab for Python

Libraries used:
* os (for interacting with the operating system, included in Python xxxx)
* pandas 1.1.0 (for dataframe, installed and imported)
* multiprocessing (for performing processes on multi cores, included in Python 3.6.9 package)
* itertools (for performing operations on iterables)
* nltk 3.5 (Natural Language Toolkit, installed and imported)
* nltk.tokenize (for tokenization, installed and imported)
* nltk.stem (for stemming the tokens, installed and imported)

    </div>

<div class="alert alert-block alert-info">
    
## Table of Contents

</div>

[1. Introduction](#Intro) <br>
[2. Importing Libraries](#libs) <br>
[3. Examining Input File](#examine) <br>
[4. Loading and Parsing Files](#load) <br>
$\;\;\;\;$[4.1. Tokenization](#tokenize) <br>
$\;\;\;\;$[4.3. Genegrate numerical representation](#whetev1) <br>
[5. Writing Output Files](#write) <br>
$\;\;\;\;$[5.1. Vocabulary List](#write-vocab) <br>
$\;\;\;\;$[5.2. Sparse Matrix](#write-sparseMat) <br>
[6. Summary](#summary) <br>
[7. References](#Ref) <br>

<div class="alert alert-block alert-success">
    
## 1.  Introduction  <a class="anchor" name="Intro"></a>
This assessment concerns textual data and the aim is to extract data, process them, and transform them into a proper format. The dataset provided is in the format of a PDF file containing comment data

<div class="alert alert-block alert-success">
    
## 2.  Importing Libraries  <a class="anchor" name="libs"></a>


In [318]:

import re
import pandas as pd
import numpy as np
from langdetect import DetectorFactory, detect
import nltk
from multiprocessing import Pool
from google.colab import files
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk import bigrams, word_tokenize
from nltk.collocations import BigramAssocMeasures ,BigramCollocationFinder
from collections import defaultdict

import nltk
nltk.download('punkt')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

<div class="alert alert-block alert-success">
    
## 3.  Examining Input File <a class="anchor" name="examine"></a>


In [319]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


<div class="alert alert-block alert-success">
    
## 4.  Loading and Parsing File <a class="anchor" name="load"></a>


In [3]:
#load data
df = pd.read_excel('/content/drive/Shared drives/FIT5196_S1_2024/A1/Students data/Task 2/Group124.xlsx')

df.shape

(3253, 4)

Having parsed the pdf file, the following observations can be made:



*   There are 3253 rows
*   4 columns


*   some missing values (Nan)
*   all columns are unamed






In [4]:
#first look at data

df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,,,,
2,,,id,snippet
3,,,Ugx717ErV1Jozg-Id0x4AaABAg,"{'channelId': 'UCet0ZrYmw-V_hsGPb7KsiOQ', 'vid..."
4,,,UgznRRv2e3Um20zo4Wl4AaABAg,"{'channelId': 'UCet0ZrYmw-V_hsGPb7KsiOQ', 'vid..."


In [5]:
#concat will combine all the sheets into one dataframe
df = pd.concat(pd.read_excel('/content/drive/Shared drives/FIT5196_S1_2024/A1/Students data/Task 2/Group124.xlsx', sheet_name= None))

In [6]:
#line 15 the shape was (3253, 4), now its (98490, 8), meaning now all the sheets are into one dataframe
df.shape

(98490, 8)

In [7]:
#getting rid of duplicates
df = df.drop_duplicates()

In [8]:
#checking shape after dropping duplicates
df.shape
#there were 4,162 duplicates found


(94328, 8)

Text Extraction & Cleaning

*  Here I extrated all the data after the 'textOriginal' and saved it into    another df for later use.
*  also removed the coloumns that were not needed for now






In [9]:

df = df['Unnamed: 3'].str.extract('textOriginal(.*)')

df.head()

Unnamed: 0,Unnamed: 1,0
Sheet0,0,
Sheet0,2,
Sheet0,3,': 'I listened to this sitting on my cold dark...
Sheet0,4,"': 'Great!', 'authorDisplayName': '@ronaldslie..."
Sheet0,5,': 'I saw online at WHO that during the ponzi ...


In [10]:

column_names = df.columns
print(column_names)

df = df.rename(columns={0: 'Text'})

df.head()

Index([0], dtype='int64')


Unnamed: 0,Unnamed: 1,Text
Sheet0,0,
Sheet0,2,
Sheet0,3,': 'I listened to this sitting on my cold dark...
Sheet0,4,"': 'Great!', 'authorDisplayName': '@ronaldslie..."
Sheet0,5,': 'I saw online at WHO that during the ponzi ...


In [11]:

df['Lowercase_Text'] = df['Text'].str.lower()


In [12]:
df.head(20)

Unnamed: 0,Unnamed: 1,Text,Lowercase_Text
Sheet0,0,,
Sheet0,2,,
Sheet0,3,': 'I listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...
Sheet0,4,"': 'Great!', 'authorDisplayName': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie..."
Sheet0,5,': 'I saw online at WHO that during the ponzi ...,': 'i saw online at who that during the ponzi ...
Sheet0,6,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy...."
Sheet0,7,"': 'IM BLASTING THIS WHEN IM AT SCHOOL', 'auth...","': 'im blasting this when im at school', 'auth..."
Sheet0,8,"': 'This is crazy awesome', 'authorDisplayName...","': 'this is crazy awesome', 'authordisplayname..."
Sheet0,9,"': 'love this video', 'authorDisplayName': '@u...","': 'love this video', 'authordisplayname': '@u..."
Sheet0,10,"': ""Glad I never watched any of these movies c...","': ""glad i never watched any of these movies c..."




*   I had to convert to utf-8 to remove the comments, this step took a while because i didnt include the (isinstance(x, bytes) else x))




In [13]:

df['Lowercase_Text'] = df['Lowercase_Text'].apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)

df['Lowercase_Text_No_Emoji'] = df['Lowercase_Text']

In [14]:
df.head(20)

Unnamed: 0,Unnamed: 1,Text,Lowercase_Text,Lowercase_Text_No_Emoji
Sheet0,0,,,
Sheet0,2,,,
Sheet0,3,': 'I listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...
Sheet0,4,"': 'Great!', 'authorDisplayName': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie..."
Sheet0,5,': 'I saw online at WHO that during the ponzi ...,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...
Sheet0,6,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy...."
Sheet0,7,"': 'IM BLASTING THIS WHEN IM AT SCHOOL', 'auth...","': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth..."
Sheet0,8,"': 'This is crazy awesome', 'authorDisplayName...","': 'this is crazy awesome', 'authordisplayname...","': 'this is crazy awesome', 'authordisplayname..."
Sheet0,9,"': 'love this video', 'authorDisplayName': '@u...","': 'love this video', 'authordisplayname': '@u...","': 'love this video', 'authordisplayname': '@u..."
Sheet0,10,"': ""Glad I never watched any of these movies c...","': ""glad i never watched any of these movies c...","': ""glad i never watched any of these movies c..."


In [15]:
# open file into read mode
with open("/content/drive/Shared drives/FIT5196_S1_2024/A1/emoji.txt", "r", encoding="utf-8") as file:

    #each line is an emoji
    emojis = [line.strip() for line in file]

    #using forloop go throught each row and replace all emojis with space
    for emoji in emojis:
       df['Lowercase_Text_No_Emoji'] =  df['Lowercase_Text_No_Emoji'].str.replace(emoji, '')


In [16]:
#checking to see if emoji is removed
df.head(20)


Unnamed: 0,Unnamed: 1,Text,Lowercase_Text,Lowercase_Text_No_Emoji
Sheet0,0,,,
Sheet0,2,,,
Sheet0,3,': 'I listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...
Sheet0,4,"': 'Great!', 'authorDisplayName': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie..."
Sheet0,5,': 'I saw online at WHO that during the ponzi ...,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...
Sheet0,6,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy...."
Sheet0,7,"': 'IM BLASTING THIS WHEN IM AT SCHOOL', 'auth...","': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth..."
Sheet0,8,"': 'This is crazy awesome', 'authorDisplayName...","': 'this is crazy awesome', 'authordisplayname...","': 'this is crazy awesome', 'authordisplayname..."
Sheet0,9,"': 'love this video', 'authorDisplayName': '@u...","': 'love this video', 'authordisplayname': '@u...","': 'love this video', 'authordisplayname': '@u..."
Sheet0,10,"': ""Glad I never watched any of these movies c...","': ""glad i never watched any of these movies c...","': ""glad i never watched any of these movies c..."


In [17]:
df.drop(columns=['Text', 'Lowercase_Text'], inplace=True)
df.head(20)


Unnamed: 0,Unnamed: 1,Lowercase_Text_No_Emoji
Sheet0,0,
Sheet0,2,
Sheet0,3,': 'i listened to this sitting on my cold dark...
Sheet0,4,"': 'great!', 'authordisplayname': '@ronaldslie..."
Sheet0,5,': 'i saw online at who that during the ponzi ...
Sheet0,6,"': ""his voice sound like it's made out of soy...."
Sheet0,7,"': 'im blasting this when im at school', 'auth..."
Sheet0,8,"': 'this is crazy awesome', 'authordisplayname..."
Sheet0,9,"': 'love this video', 'authordisplayname': '@u..."
Sheet0,10,"': ""glad i never watched any of these movies c..."


this code cell takes around 6 minutes to completely finish loading which is a long time i could have optimized it by not using the lang detcet but that was one of the assigment requirements.

In [18]:
def detect_lang(comment):
    try:
        lang = detect(comment)
        if lang == 'en':
           return comment
    except:
        pass
    return None

df['English_Text'] = df['Lowercase_Text_No_Emoji'].apply(lambda x: detect_lang(x) if isinstance(x, str) else None)


english_df = df.dropna(subset=['English_Text'])


In [320]:
df.shape

(36788, 6)

In [20]:
# there is less rows meaning the method did work
english_df.shape

(36788, 2)



when i first did the extracting it wasnt in a method but then i realised i had to do it to both dataframes so i thought if a created a method i could easily apply it to both instead of repeating the code




In [21]:
def extract_channel(DF):

  DF.loc[:,'channel'] = DF['Lowercase_Text_No_Emoji'].str.extract(r"'value':\s+'(.*?)'", expand=False)

  return DF

was hvaing issues with the channel id pattern kept getting a Nan value so i printed the entire row to see what it would look like, now i need to inlcude the value int th pattern

In [22]:

row = english_df.iloc[2].to_dict()
print(row)

{'Lowercase_Text_No_Emoji': '\': \'i saw online at who that during the ponzi scheme called covid there were 8 million people dying each year from cigarette smoking. i thought super strange because not that many people were "dying" from covid at that time, so what was the deal about making this covid the biggest news maker? 8 million people should be the biggest news maker. so.....must be the cigarette makers that are paying off the who to be quiet.\\n not that there is such a thing as covid. dr. david martin found over 4000 illegal patents with the name of covid. he has been watching for many years the relationship between patents and weapons. he says that covid is made up to cause billions of people to take those weapons/vaccines. he says this can be only one thing and that is genocide. he knows the names of all the people that put this genocide into motion. he has put their names online. he is suing as many as possible including medicaid and medicare.\\nnot that there is such a thing

In [23]:
extract_channel(english_df)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF.loc[:,'channel'] = DF['Lowercase_Text_No_Emoji'].str.extract(r"'value':\s+'(.*?)'", expand=False)


Unnamed: 0,Unnamed: 1,Lowercase_Text_No_Emoji,English_Text,channel
Sheet0,3,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,ucyyvw-yw9y_x34seiv1gqgw
Sheet0,4,"': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...",ucxmead5w1ptnx2ftncnif9q
Sheet0,5,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...,ucrbtci2rh4_idtcb_eafzgq
Sheet0,6,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....",uczhzeb6gw2mtsvcwu-6fwyq
Sheet0,7,"': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth...",ucrfn57im3avpoo3hvt69z9w
...,...,...,...,...
Sheet28,3101,"': 'сперва прикольно, а потом дура с котами вс...","': 'сперва прикольно, а потом дура с котами вс...",ucqezb1vfkiifwwygrdlonla
Sheet28,3102,"': 'what did that cat say???', 'authordisplayn...","': 'what did that cat say???', 'authordisplayn...",ucmgtvcfy2bpumi3xhvidsyg
Sheet28,3103,"': 'why are so many cats saying the n word?!',...","': 'why are so many cats saying the n word?!',...",uc2bmrqfs4lbqm3mtojmxlwa
Sheet28,3104,"': 'this is confusing', 'authordisplayname': '...","': 'this is confusing', 'authordisplayname': '...",uc8ygpj9ghc3xuyxiccb7tfg


In [72]:
extract_channel(df)

Unnamed: 0,Lowercase_Text_No_Emoji,English_Text,channel,channel_ID,Count
0,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,ucyyvw-yw9y_x34seiv1gqgw,2,1
1,"': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...",ucxmead5w1ptnx2ftncnif9q,3,1
2,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...,ucrbtci2rh4_idtcb_eafzgq,4,1
3,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....",uczhzeb6gw2mtsvcwu-6fwyq,5,1
4,"': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth...",ucrfn57im3avpoo3hvt69z9w,6,1
...,...,...,...,...,...
36783,"': 'сперва прикольно, а потом дура с котами вс...","': 'сперва прикольно, а потом дура с котами вс...",ucqezb1vfkiifwwygrdlonla,36499,1
36784,"': 'what did that cat say???', 'authordisplayn...","': 'what did that cat say???', 'authordisplayn...",ucmgtvcfy2bpumi3xhvidsyg,36500,1
36785,"': 'why are so many cats saying the n word?!',...","': 'why are so many cats saying the n word?!',...",uc2bmrqfs4lbqm3mtojmxlwa,36501,1
36786,"': 'this is confusing', 'authordisplayname': '...","': 'this is confusing', 'authordisplayname': '...",uc8ygpj9ghc3xuyxiccb7tfg,36502,1


In [25]:
english_df.head(20)

Unnamed: 0,Unnamed: 1,Lowercase_Text_No_Emoji,English_Text,channel
Sheet0,3,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,ucyyvw-yw9y_x34seiv1gqgw
Sheet0,4,"': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...",ucxmead5w1ptnx2ftncnif9q
Sheet0,5,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...,ucrbtci2rh4_idtcb_eafzgq
Sheet0,6,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....",uczhzeb6gw2mtsvcwu-6fwyq
Sheet0,7,"': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth...",ucrfn57im3avpoo3hvt69z9w
Sheet0,8,"': 'this is crazy awesome', 'authordisplayname...","': 'this is crazy awesome', 'authordisplayname...",ucsjfstsyg4uly5m6oitmajg
Sheet0,9,"': 'love this video', 'authordisplayname': '@u...","': 'love this video', 'authordisplayname': '@u...",ucunrup5jjcfkl2yvxt7xzsq
Sheet0,10,"': ""glad i never watched any of these movies c...","': ""glad i never watched any of these movies c...",ucucdduck5ocaiys7uvaoc8w
Sheet0,11,"': 'thank you for this nice video❤', 'authordi...","': 'thank you for this nice video❤', 'authordi...",uc-tedxv4nyguzmuuyyzptdq
Sheet0,12,': 'ugh. paypal sold out all their paypal cred...,': 'ugh. paypal sold out all their paypal cred...,uctq7-bnium8m4ixbops9ujq


Here i had to map the unique channel Id to channel

which was just a random code but represented each channel meaning each row in chanel represnted a channel but it had no meaninful value

so i converted it to a number instead. this was it was easier to work with

In [26]:
def mapping_channelID(DF, chan_column):
  #give uniqe channels a number so its easier to group by later on
  unique_ID = DF[chan_column].unique()

  #mapping all the channel IDs to the numbers
  mapping = {channel: i+1 for i, channel in enumerate(unique_ID)}

  DF.loc[:,'channel_ID'] = DF['channel'].map(mapping)

  groupby_channel_ID = DF.groupby('channel_ID')

  return DF


In [27]:
english_df.head(50)

Unnamed: 0,Unnamed: 1,Lowercase_Text_No_Emoji,English_Text,channel
Sheet0,3,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,ucyyvw-yw9y_x34seiv1gqgw
Sheet0,4,"': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...",ucxmead5w1ptnx2ftncnif9q
Sheet0,5,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...,ucrbtci2rh4_idtcb_eafzgq
Sheet0,6,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....",uczhzeb6gw2mtsvcwu-6fwyq
Sheet0,7,"': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth...",ucrfn57im3avpoo3hvt69z9w
Sheet0,8,"': 'this is crazy awesome', 'authordisplayname...","': 'this is crazy awesome', 'authordisplayname...",ucsjfstsyg4uly5m6oitmajg
Sheet0,9,"': 'love this video', 'authordisplayname': '@u...","': 'love this video', 'authordisplayname': '@u...",ucunrup5jjcfkl2yvxt7xzsq
Sheet0,10,"': ""glad i never watched any of these movies c...","': ""glad i never watched any of these movies c...",ucucdduck5ocaiys7uvaoc8w
Sheet0,11,"': 'thank you for this nice video❤', 'authordi...","': 'thank you for this nice video❤', 'authordi...",uc-tedxv4nyguzmuuyyzptdq
Sheet0,12,': 'ugh. paypal sold out all their paypal cred...,': 'ugh. paypal sold out all their paypal cred...,uctq7-bnium8m4ixbops9ujq


In [28]:
#english_df.drop(columns=['en_by_channel', 'channel_Id'], inplace=True)

In [74]:
df.head(20)

Unnamed: 0,Lowercase_Text_No_Emoji,English_Text,channel,channel_ID,Count
0,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,ucyyvw-yw9y_x34seiv1gqgw,2,1
1,"': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...",ucxmead5w1ptnx2ftncnif9q,3,1
2,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...,ucrbtci2rh4_idtcb_eafzgq,4,1
3,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....",uczhzeb6gw2mtsvcwu-6fwyq,5,1
4,"': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth...",ucrfn57im3avpoo3hvt69z9w,6,1
5,"': 'this is crazy awesome', 'authordisplayname...","': 'this is crazy awesome', 'authordisplayname...",ucsjfstsyg4uly5m6oitmajg,7,1
6,"': 'love this video', 'authordisplayname': '@u...","': 'love this video', 'authordisplayname': '@u...",ucunrup5jjcfkl2yvxt7xzsq,8,2
7,"': 'excellent editing', 'authordisplayname': '...","': 'excellent editing', 'authordisplayname': '...",ucunrup5jjcfkl2yvxt7xzsq,8,2
8,"': ""glad i never watched any of these movies c...","': ""glad i never watched any of these movies c...",ucucdduck5ocaiys7uvaoc8w,9,1
9,"': 'thank you for this nice video❤', 'authordi...","': 'thank you for this nice video❤', 'authordi...",uc-tedxv4nyguzmuuyyzptdq,10,1


In [30]:
mapping_channelID(english_df,'channel')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF.loc[:,'channel_ID'] = DF['channel'].map(mapping)


Unnamed: 0,Unnamed: 1,Lowercase_Text_No_Emoji,English_Text,channel,channel_ID
Sheet0,3,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,ucyyvw-yw9y_x34seiv1gqgw,1
Sheet0,4,"': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...",ucxmead5w1ptnx2ftncnif9q,2
Sheet0,5,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...,ucrbtci2rh4_idtcb_eafzgq,3
Sheet0,6,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....",uczhzeb6gw2mtsvcwu-6fwyq,4
Sheet0,7,"': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth...",ucrfn57im3avpoo3hvt69z9w,5
...,...,...,...,...,...
Sheet28,3101,"': 'сперва прикольно, а потом дура с котами вс...","': 'сперва прикольно, а потом дура с котами вс...",ucqezb1vfkiifwwygrdlonla,36098
Sheet28,3102,"': 'what did that cat say???', 'authordisplayn...","': 'what did that cat say???', 'authordisplayn...",ucmgtvcfy2bpumi3xhvidsyg,36099
Sheet28,3103,"': 'why are so many cats saying the n word?!',...","': 'why are so many cats saying the n word?!',...",uc2bmrqfs4lbqm3mtojmxlwa,36100
Sheet28,3104,"': 'this is confusing', 'authordisplayname': '...","': 'this is confusing', 'authordisplayname': '...",uc8ygpj9ghc3xuyxiccb7tfg,36101


In [75]:
mapping_channelID(df,'channel')


Unnamed: 0,Lowercase_Text_No_Emoji,English_Text,channel,channel_ID,Count
0,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,ucyyvw-yw9y_x34seiv1gqgw,1,1
1,"': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...",ucxmead5w1ptnx2ftncnif9q,2,1
2,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...,ucrbtci2rh4_idtcb_eafzgq,3,1
3,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....",uczhzeb6gw2mtsvcwu-6fwyq,4,1
4,"': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth...",ucrfn57im3avpoo3hvt69z9w,5,1
...,...,...,...,...,...
36783,"': 'сперва прикольно, а потом дура с котами вс...","': 'сперва прикольно, а потом дура с котами вс...",ucqezb1vfkiifwwygrdlonla,36098,1
36784,"': 'what did that cat say???', 'authordisplayn...","': 'what did that cat say???', 'authordisplayn...",ucmgtvcfy2bpumi3xhvidsyg,36099,1
36785,"': 'why are so many cats saying the n word?!',...","': 'why are so many cats saying the n word?!',...",uc2bmrqfs4lbqm3mtojmxlwa,36100,1
36786,"': 'this is confusing', 'authordisplayname': '...","': 'this is confusing', 'authordisplayname': '...",uc8ygpj9ghc3xuyxiccb7tfg,36101,1


In [42]:
def Channel_Count(DF):

  channel_counts = DF.groupby('channel_ID').size().reset_index(name='Count')

  DF = pd.merge(DF, channel_counts, on='channel_ID')

  return DF

In [53]:
english_df = Channel_Count(english_df)

In [80]:
df = Channel_Count(df)

In [81]:

df.head(20)


Unnamed: 0,Lowercase_Text_No_Emoji,English_Text,channel,channel_ID,Count
0,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,ucyyvw-yw9y_x34seiv1gqgw,1,1
1,"': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...",ucxmead5w1ptnx2ftncnif9q,2,1
2,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...,ucrbtci2rh4_idtcb_eafzgq,3,1
3,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....",uczhzeb6gw2mtsvcwu-6fwyq,4,1
4,"': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth...",ucrfn57im3avpoo3hvt69z9w,5,1
5,"': 'this is crazy awesome', 'authordisplayname...","': 'this is crazy awesome', 'authordisplayname...",ucsjfstsyg4uly5m6oitmajg,6,1
6,"': 'love this video', 'authordisplayname': '@u...","': 'love this video', 'authordisplayname': '@u...",ucunrup5jjcfkl2yvxt7xzsq,7,2
7,"': 'excellent editing', 'authordisplayname': '...","': 'excellent editing', 'authordisplayname': '...",ucunrup5jjcfkl2yvxt7xzsq,7,2
8,"': ""glad i never watched any of these movies c...","': ""glad i never watched any of these movies c...",ucucdduck5ocaiys7uvaoc8w,8,1
9,"': 'thank you for this nice video❤', 'authordi...","': 'thank you for this nice video❤', 'authordi...",uc-tedxv4nyguzmuuyyzptdq,9,1


I wasnt sure if my code was wrong but i used this to check if i had any channels with a count of 15 or greater and i got a no so im assuming mu data had no 15 or more which i didnt belive at first but the assigment discription did say "only if you have 15 or more" so i think what i have is correct.

In [55]:
comments_15OR_More = english_df[english_df['Count'] >= 15]
if not comments_15OR_More.empty:

 print("There are channels with 15 or more comments")

else:
  print("no channel with 15 or more comments ")

no channel with 15 or more comments 


Step 3  Generate csv file

In [82]:
Final_df = pd.DataFrame()

Final_df['channel_id'] = english_df['channel_ID']
Final_df['eng_comment_count'] = english_df['Count']
Final_df['all_comment_count'] = df['Count']

In [83]:

Final_df.head(20)

Unnamed: 0,channel_id,eng_comment_count,all_comment_count
0,1,1,1
1,2,1,1
2,3,1,1
3,4,1,1
4,5,1,1
5,6,1,1
6,7,2,2
7,7,2,2
8,8,1,1
9,9,1,1


In [84]:
Final_df.to_csv('124_channel_list.csv', index= False)

In [86]:
files.download('124_channel_list.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<div class="alert alert-block alert-warning">
    
### 4.1. Tokenization <a class="anchor" name="tokenize"></a>


Step 4 Generate the unigram and bigram lists and output as vocab.txt


In [225]:
df.head(20)

Unnamed: 0,Lowercase_Text_No_Emoji,English_Text,channel,channel_ID,Count,Comments
0,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,ucyyvw-yw9y_x34seiv1gqgw,1,1,i listened to this sitting on my cold dark por...
1,"': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...",ucxmead5w1ptnx2ftncnif9q,2,1,great!
2,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...,ucrbtci2rh4_idtcb_eafzgq,3,1,i saw online at who that during the ponzi sche...
3,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....",uczhzeb6gw2mtsvcwu-6fwyq,4,1,@aminesemlali6199
4,"': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth...",ucrfn57im3avpoo3hvt69z9w,5,1,im blasting this when im at school
5,"': 'this is crazy awesome', 'authordisplayname...","': 'this is crazy awesome', 'authordisplayname...",ucsjfstsyg4uly5m6oitmajg,6,1,this is crazy awesome
6,"': 'love this video', 'authordisplayname': '@u...","': 'love this video', 'authordisplayname': '@u...",ucunrup5jjcfkl2yvxt7xzsq,7,2,love this video
7,"': 'excellent editing', 'authordisplayname': '...","': 'excellent editing', 'authordisplayname': '...",ucunrup5jjcfkl2yvxt7xzsq,7,2,excellent editing
8,"': ""glad i never watched any of these movies c...","': ""glad i never watched any of these movies c...",ucucdduck5ocaiys7uvaoc8w,8,1,@theclipman9283
9,"': 'thank you for this nice video❤', 'authordi...","': 'thank you for this nice video❤', 'authordi...",uc-tedxv4nyguzmuuyyzptdq,9,1,thank you for this nice video❤


In [226]:
#extracted the text only from each row to them tokenize
pattern = r": '(.*?)',"
df['Comments'] = df['English_Text'].str.extract(pattern)
df.head(20)

Unnamed: 0,Lowercase_Text_No_Emoji,English_Text,channel,channel_ID,Count,Comments
0,': 'i listened to this sitting on my cold dark...,': 'i listened to this sitting on my cold dark...,ucyyvw-yw9y_x34seiv1gqgw,1,1,i listened to this sitting on my cold dark por...
1,"': 'great!', 'authordisplayname': '@ronaldslie...","': 'great!', 'authordisplayname': '@ronaldslie...",ucxmead5w1ptnx2ftncnif9q,2,1,great!
2,': 'i saw online at who that during the ponzi ...,': 'i saw online at who that during the ponzi ...,ucrbtci2rh4_idtcb_eafzgq,3,1,i saw online at who that during the ponzi sche...
3,"': ""his voice sound like it's made out of soy....","': ""his voice sound like it's made out of soy....",uczhzeb6gw2mtsvcwu-6fwyq,4,1,@aminesemlali6199
4,"': 'im blasting this when im at school', 'auth...","': 'im blasting this when im at school', 'auth...",ucrfn57im3avpoo3hvt69z9w,5,1,im blasting this when im at school
5,"': 'this is crazy awesome', 'authordisplayname...","': 'this is crazy awesome', 'authordisplayname...",ucsjfstsyg4uly5m6oitmajg,6,1,this is crazy awesome
6,"': 'love this video', 'authordisplayname': '@u...","': 'love this video', 'authordisplayname': '@u...",ucunrup5jjcfkl2yvxt7xzsq,7,2,love this video
7,"': 'excellent editing', 'authordisplayname': '...","': 'excellent editing', 'authordisplayname': '...",ucunrup5jjcfkl2yvxt7xzsq,7,2,excellent editing
8,"': ""glad i never watched any of these movies c...","': ""glad i never watched any of these movies c...",ucucdduck5ocaiys7uvaoc8w,8,1,@theclipman9283
9,"': 'thank you for this nice video❤', 'authordi...","': 'thank you for this nice video❤', 'authordi...",uc-tedxv4nyguzmuuyyzptdq,9,1,thank you for this nice video❤


In [227]:
token_df = pd.DataFrame()
token_df['Token'] = df['Comments'].str.split()
token_df = token_df.explode('Token')

In [228]:
token_df.shape


(415883, 1)

In [229]:
# open file into read mode
with open("/content/drive/Shared drives/FIT5196_S1_2024/A1/stopwords_en.txt", "r", encoding="utf-8") as file:
 stop_words = [line.strip() for line in file]

In [230]:
token_df= token_df[~token_df['Token'].isin(stop_words)]

In [231]:
#started with 415883 words and now only 214611 means the stop words were removed
token_df.shape

(214611, 1)

here again i wasnt sure if the output token was what its supposed to be but from what i can see its doing whats its supoosed to because


*   strated off with 214611 rows and went down to 198717 rows meaning the code did remove some tokens
*   but again when i went throught it did have some special characters but i used the r"[a-zA-Z]+"  pattern which was in the assigment so im not too sure.



In [315]:
pattern = r"[a-zA-Z]+"

token_df = token_df.dropna(subset =['Token'])
token_df= token_df[token_df['Token'].str.contains(pattern)]

token_df.shape

(198717, 2)

In [236]:
token_count = token_df.groupby('Token').size().reset_index(name='Token_Count')

token_df = pd.merge(token_df, token_count, on='Token')



In [270]:

token_df.shape

(198717, 2)

In [246]:
unique_tokens_counts =[(token, count) for token, count in zip(token_df['Token'], token_df['Token_Count'])]

unique_tokens_counts = list(set(unique_tokens_counts))

In [300]:
token_df_grouped = pd.DataFrame(unique_tokens_counts, columns=['Token','Count'])

In [301]:
token_df_grouped.head(20)

Unnamed: 0,Token,Count
0,lentils,1
1,\nun,1
2,emotionen,1
3,morte,3
4,stenography?,1
5,comprehensive,8
6,@letodk,1
7,limitations.,1
8,album.,1
9,dread,1


<div class="alert alert-block alert-warning">
    
### 5.1. Vocabulary List <a class="anchor" name="write-vocab"></a>

In [302]:
porter_stemmer = PorterStemmer()

token_df_grouped['Tokens'] = token_df_grouped['Token'].apply(lambda token: porter_stemmer.stem(token))

In [303]:
token_df_grouped = token_df_grouped[token_df_grouped['Token'].apply(len) >=3]

In [312]:

sorted_words = token_df_grouped['Token'].tolist()
sorted_words.sort()

In [313]:
print(sorted_words)



<div class="alert alert-block alert-success">
    
## 5. Writing Output Files <a class="anchor" name="write"></a>


In [323]:
with open('124_vocab.txt', 'w') as file:
  for word in sorted_words:
    file.write(word +'\n')

In [324]:
files.download('124_vocab.txt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<div class="alert alert-block alert-warning">
    
### 5.2. Sparse Matrix <a class="anchor" name="write-sparseMat"></a>


<div class="alert alert-block alert-success">
    
## 6. Summary <a class="anchor" name="summary"></a>
In conclusion I learnt a lot in this assigment especially about dataframes, althought i couldnt finish I think if i had more time i woulve been able too.

the difficult steps  were step 2 and 4 , I was stuck on step 2 for 2 days because the processing was very slow which i couldnt fix at the end but at least it was working.

step 4 was more difficult bevause iv never worked with unigrams or bigrams before.

<div class="alert alert-block alert-success">
    
## 7. References <a class="anchor" name="Ref"></a>

Pandas documentation#  pandas documentation - pandas 2.2.2 documentation.
Available at: https://pandas.pydata.org/docs/ (Accessed: 8-21 April 2024).


Monash university Unit FIT5196 lab 1 through to lab 5 (Accessed: 8-21 April 2024).

 YouTube. Available at: https://www.youtube.com/ (Accessed: 8-21 April 2024).