<div class="alert alert-block alert-danger">

# FIT5196 Task 2 in Assessment 1
    
#### Student Name: Anish S, Michael Bradtke
#### Student ID: 34113339, 32492464

Date: 28/03/2024

Environment: Python

Libraries used:
* pandas (for extracting data from excel and handling it)
* nltk (for text processing the data - tokenization, stemming, finding n-grams etc.)
* collections (for aid in counting)
* re (for to simplify and aid in pattern searching)
* langdetect (for detecting the language used in comments/snippets)
* sklearn (for counting unigram and bigram token-frequencies in english comments)


**TASK 2 PART DONE BY ANISH S**
    </div>

<div class="alert alert-block alert-info">
    
## Table of Contents

</div>

[1. Introduction](#Intro) <br>
[2. Importing Libraries](#libs) <br>
[3. Examining Input File & Loading Data](#step1) <br>
[4. Text Exraction, cleaning and Processing](#step2&3) <br>
[5. Generating Vocab](#step4) <br>
$\;\;\;\;$[5.1. Vocabulary List](#write-vocab) <br>
[6. Sparse Matrix & CountVec](#step6) <br>
[7. Summary](#summary) <br>
[8. References](#Ref) <br>

<div class="alert alert-block alert-success">
    
## Introduction  <a class="anchor" name="Intro"></a>

This assessment concerns textual data and the aim is to extract data, process them, and transform them into a proper format.

The dataset provided consists of excel file with 30 sheets of data. Each of these sheets consists of a table of 2 columns - 'id' and 'snippet'.

The objective is to extract *English* comments from ***textOriginal*** fields from the 'snippet' column, only from those channels (identified by *channelId*) that have atleast 15 english comments to it, and then to generate a vocabulary and count-frequency of tokens from these comments.

There are 3 output files desired -
* .csv file containing the (*channel_id, all_comment_count, eng_comment_count*)
* vocab.txt file with the vocabulary of tokens (unigram and bigram) in the format (*token : token_index*)
* countvec.txt with the count-frquency of all words from vocab being present in the english comments of channels with the format (*channelId, token1:token1_frq, token2:token2_frq, .....*)

<div class="alert alert-block alert-success">
    
## Importing Libraries  <a class="anchor" name="libs"></a>

In this assessment, any python packages is permitted to be used. The following packages were used to accomplish the related tasks:

* **os:** to interact with the operating system, e.g. navigate through folders to read files - *to be used by evaluators at their discretion*
* **re:** to define and use regular expressions; to fasten the process of pattern matching.
* **pandas:** to work with dataframes; extract data to them and handle it.
* **langdetect:** to detect languges from of the raw-text.
* **collections:** to simplify the process of counting; setting counters to default values during intitialisation.
* **nltk:** text processing raw textual data - tokenisation, stemming, finding n-grams, etc.
* **sklearn:** counting token-frequencies using CountVectorizer

RUN THIS CELL ONCE AT THE START üëá

In [1]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m981.5/981.5 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993227 sha256=2f702ffc5d8e6f524402632f1a3f5b3de918f4b9859b86d229d50f9616964ebf
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [2]:
import os # have not used this but left for evaluators to make use of if needed during file upload and access
import re
import pandas as pd
from langdetect import DetectorFactory, detect
from collections import defaultdict, Counter
import csv
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import nltk
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from nltk.tokenize import MWETokenizer
import numpy as np
from nltk.util import ngrams
from nltk.probability import *
from sklearn.feature_extraction.text import CountVectorizer

-------------------------------------

<div class="alert alert-block alert-success">
    
## STEP 1 : Examining Input File & Loading Data into right format <a class="anchor" name="step1"></a>

MOUNTING MY DRIVE

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Read the Excel file
excel_file = pd.ExcelFile('/content/drive/Shareddrives/FIT5196_S1_2024/A1/Students data/Task 2/Group125.xlsx')
excel_file

<pandas.io.excel._base.ExcelFile at 0x7e58d55453c0>

### Checking the structure data in input file

In [5]:
# show the names of all the sheets
excel_file.sheet_names

['Sheet0',
 'Sheet1',
 'Sheet2',
 'Sheet3',
 'Sheet4',
 'Sheet5',
 'Sheet6',
 'Sheet7',
 'Sheet8',
 'Sheet9',
 'Sheet10',
 'Sheet11',
 'Sheet12',
 'Sheet13',
 'Sheet14',
 'Sheet15',
 'Sheet16',
 'Sheet17',
 'Sheet18',
 'Sheet19',
 'Sheet20',
 'Sheet21',
 'Sheet22',
 'Sheet23',
 'Sheet24',
 'Sheet25',
 'Sheet26',
 'Sheet27',
 'Sheet28',
 'Sheet29']

In [6]:
check_df = excel_file.parse('Sheet1')
check_df.head(103)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,,,,,
1,,,,,
2,,,,id,snippet
3,,,,UgwIbRJulYHkv_aa-pR4AaABAg,"{'channelId': 'UCet0ZrYmw-V_hsGPb7KsiOQ', 'vid..."
4,,,,UgzATskLkV73PwWOf2V4AaABAg,"{'channelId': 'UCet0ZrYmw-V_hsGPb7KsiOQ', 'vid..."
...,...,...,...,...,...
98,,,,UgzW48vmgX71LDpBFXh4AaABAg,"{'channelId': 'UCq-Fj5jknLsUf-MWSy4_brA', 'vid..."
99,,,,UgzoEeOGMSPDWkr07BN4AaABAg,"{'channelId': 'UCnUqX0KIAu3svfrlkrYaZew', 'vid..."
100,,,,UgxlOD6gr0ph-kqdV4t4AaABAg,"{'channelId': 'UCnUqX0KIAu3svfrlkrYaZew', 'vid..."
101,,,,Ugzrr7aDXmv5IS6kldh4AaABAg,"{'channelId': 'UCv6T1ljYi3CETJ1Jqj1bFMQ', 'vid..."


### Extracting the data in sheets to pandas dataframe

In [7]:
# Dictionary to store DataFrames for each sheet
dfs = {}

# Iterate through each sheet
for sheet_num in range(30):
    # Read the sheet
    sheet_name = f'Sheet{sheet_num}'
    df = pd.read_excel(excel_file, sheet_name=sheet_name)

    # Find the starting row and column of the table
    start_row, start_col = 0, 0
    for i in range(df.shape[0]):
        if not df.iloc[i].isnull().all():
            start_row = i
            break
    for j in range(df.shape[1]):
        if not df.iloc[:, j].isnull().all():
            start_col = j
            break

    # Read the table data into a DataFrame
    df = pd.read_excel(excel_file, sheet_name=sheet_name, header=None)
    df = df.iloc[start_row:, start_col:]

    # Set column names
    df.columns = df.iloc[0].fillna('Unnamed')
    df = df.iloc[2:]
    df.columns = ['id', 'snippet']  # Set column names to 'id' and 'snippet'

    # Reset index
    df.reset_index(drop=True, inplace=True)

    # Save the DataFrame to the dictionary
    dfs[sheet_name] = df

# Now dfs contains DataFrames for each sheet
# You can access them using dfs['Sheet0'], dfs['Sheet1'], etc.


In [8]:
dfs['Sheet1'].head(100)

Unnamed: 0,id,snippet
0,UgwIbRJulYHkv_aa-pR4AaABAg,"{'channelId': 'UCet0ZrYmw-V_hsGPb7KsiOQ', 'vid..."
1,UgzATskLkV73PwWOf2V4AaABAg,"{'channelId': 'UCet0ZrYmw-V_hsGPb7KsiOQ', 'vid..."
2,UgxPNsmNZadEw4xfoSh4AaABAg,"{'channelId': 'UCsT0YIqwnpJCM-mx7-gSA4Q', 'vid..."
3,UgwozwB4-ldSkYrwIJ14AaABAg,"{'channelId': 'UCNye-wNBqNL5ZzHSJj3l8Bg', 'vid..."
4,UgxI8Fi8Hne5ocMiGVt4AaABAg,"{'channelId': 'UC9x0AN7BWHpCDHSm9NiJFJQ', 'vid..."
...,...,...
95,UgzW48vmgX71LDpBFXh4AaABAg,"{'channelId': 'UCq-Fj5jknLsUf-MWSy4_brA', 'vid..."
96,UgzoEeOGMSPDWkr07BN4AaABAg,"{'channelId': 'UCnUqX0KIAu3svfrlkrYaZew', 'vid..."
97,UgxlOD6gr0ph-kqdV4t4AaABAg,"{'channelId': 'UCnUqX0KIAu3svfrlkrYaZew', 'vid..."
98,Ugzrr7aDXmv5IS6kldh4AaABAg,"{'channelId': 'UCv6T1ljYi3CETJ1Jqj1bFMQ', 'vid..."


In [9]:
dfs['Sheet29']

Unnamed: 0,id,snippet
0,Ugw89KqESLnH70IgTRp4AaABAg,"{'channelId': 'UCet0ZrYmw-V_hsGPb7KsiOQ', 'vid..."
1,Ugyy5LwulTB9JkRDKB54AaABAg,"{'channelId': 'UCet0ZrYmw-V_hsGPb7KsiOQ', 'vid..."
2,UgwmbtiVQvmIM9sC6sJ4AaABAg,"{'channelId': 'UCrUTphxarZzISDiXFsFnUUA', 'vid..."
3,UgxwE5ZjO500mfXthxJ4AaABAg,"{'channelId': 'UCNye-wNBqNL5ZzHSJj3l8Bg', 'vid..."
4,Ugym1RVuEe0EKWK1pKV4AaABAg,"{'channelId': 'UCNye-wNBqNL5ZzHSJj3l8Bg', 'vid..."
...,...,...
3463,Ugwu8EmB5HY7roDrJZp4AaABAg,"{'channelId': 'UCTZmwwn0Wd5YsAxw0NZzyJA', 'vid..."
3464,Ugywo6qSH8beAcXpRc14AaABAg,"{'channelId': 'UCTZmwwn0Wd5YsAxw0NZzyJA', 'vid..."
3465,Ugz-JVqisRoYMaPBJaZ4AaABAg,"{'channelId': 'UC_DtZXGLt74sv8EV1Covp9w', 'vid..."
3466,UgweA9YwsJJW7mmPTgJ4AaABAg,"{'channelId': 'UC_DtZXGLt74sv8EV1Covp9w', 'vid..."


How many rows of data is to be there finally?

In [10]:
total_with_dup = 0
for sheet_name, df in dfs.items():
    print(f"Number of rows in {sheet_name}: {len(df)}")
    total_with_dup += len(df)

print(f"TOTAL ROWS : {total_with_dup}")

Number of rows in Sheet0: 3424
Number of rows in Sheet1: 3222
Number of rows in Sheet2: 3035
Number of rows in Sheet3: 3100
Number of rows in Sheet4: 3295
Number of rows in Sheet5: 3371
Number of rows in Sheet6: 3208
Number of rows in Sheet7: 3003
Number of rows in Sheet8: 3291
Number of rows in Sheet9: 3024
Number of rows in Sheet10: 3146
Number of rows in Sheet11: 3005
Number of rows in Sheet12: 3144
Number of rows in Sheet13: 3028
Number of rows in Sheet14: 3200
Number of rows in Sheet15: 3354
Number of rows in Sheet16: 3350
Number of rows in Sheet17: 3354
Number of rows in Sheet18: 3320
Number of rows in Sheet19: 3417
Number of rows in Sheet20: 3175
Number of rows in Sheet21: 3309
Number of rows in Sheet22: 3398
Number of rows in Sheet23: 3060
Number of rows in Sheet24: 3171
Number of rows in Sheet25: 3194
Number of rows in Sheet26: 3300
Number of rows in Sheet27: 3053
Number of rows in Sheet28: 3278
Number of rows in Sheet29: 3468
TOTAL ROWS : 96697


### Combining Data from all sheets and dropping duplicates

In [11]:
# Combine all DataFrames into one
combined_df = pd.concat(dfs.values(), ignore_index=True)

# Remove duplicates
combined_df.drop_duplicates(inplace=True)

# Reset index
combined_df.reset_index(drop=True, inplace=True)

print(len(combined_df))

combined_df

81596


Unnamed: 0,id,snippet
0,UgwW0S-R3kDpXVZAvSB4AaABAg,"{'channelId': 'UCet0ZrYmw-V_hsGPb7KsiOQ', 'vid..."
1,UgytTFIEqvnNoGkuFw94AaABAg,"{'channelId': 'UCet0ZrYmw-V_hsGPb7KsiOQ', 'vid..."
2,Ugx9vT8IzSlbjuVcSmF4AaABAg,"{'channelId': 'UCet0ZrYmw-V_hsGPb7KsiOQ', 'vid..."
3,UgwbhszDzd-9xc3Ybd14AaABAg,"{'channelId': 'UCsT0YIqwnpJCM-mx7-gSA4Q', 'vid..."
4,UgzO4jz323bfVovOanh4AaABAg,"{'channelId': 'UCsT0YIqwnpJCM-mx7-gSA4Q', 'vid..."
...,...,...
81591,Ugx66p2O4191h8CtCed4AaABAg,"{'channelId': 'UCa_3ly2XPt-y3tkkEtGfpCw', 'vid..."
81592,Ugy-SHpGs19zgmz1nOZ4AaABAg,"{'channelId': 'UCJgDzk8RDB_gSsn0ZGxLODA', 'vid..."
81593,Ugwu8EmB5HY7roDrJZp4AaABAg,"{'channelId': 'UCTZmwwn0Wd5YsAxw0NZzyJA', 'vid..."
81594,Ugywo6qSH8beAcXpRc14AaABAg,"{'channelId': 'UCTZmwwn0Wd5YsAxw0NZzyJA', 'vid..."


In [12]:
##### Step 1 COMPLETE

<div class="alert alert-block alert-success">
    
## STEP 2 & 3 : Text Exraction, cleaning and Processing <a class="anchor" name="step2&3"></a>

### STEP 2: Emoji Cleaning

In [13]:
# Set the seed for langdetect
DetectorFactory.seed = 0

Extracting the list of emojis to remove from emoji.txt ‚ñ∂

In [14]:
# Load emoji list
with open('/content/drive/Shareddrives/FIT5196_S1_2024/A1/emoji.txt', 'r', encoding='utf-8') as file:
    emoji_list = file.read().splitlines()

print(emoji_list)

['ü•á', 'ü•à', 'ü•â', 'üÜé', 'üèß', 'üÖ∞Ô∏è', 'üá¶üá´', 'üá¶üá±', 'üá©üáø', 'üá¶üá∏', 'üá¶üá©', 'üá¶üá¥', 'üá¶üáÆ', 'üá¶üá∂', 'üá¶üá¨', '‚ôí', 'üá¶üá∑', '‚ôà', 'üá¶üá≤', 'üá¶üáº', 'üá¶üá®', 'üá¶üá∫', 'üá¶üáπ', 'üá¶üáø', 'üîô', 'üÖ±Ô∏è', 'üáßüá∏', 'üáßüá≠', 'üáßüá©', 'üáßüáß', 'üáßüáæ', 'üáßüá™', 'üáßüáø', 'üáßüáØ', 'üáßüá≤', 'üáßüáπ', 'üáßüá¥', 'üáßüá¶', 'üáßüáº', 'üáßüáª', 'üáßüá∑', 'üáÆüá¥', 'üáªüá¨', 'üáßüá≥', 'üáßüá¨', 'üáßüá´', 'üáßüáÆ', 'üÜë', 'üÜí', 'üá∞üá≠', 'üá®üá≤', 'üá®üá¶', 'üáÆüá®', '‚ôã', 'üá®üáª', '‚ôë', 'üáßüá∂', 'üá∞üáæ', 'üá®üá´', 'üá™üá¶', 'üáπüá©', 'üá®üá±', 'üá®üá≥', 'üá®üáΩ', 'üéÑ', 'üá®üáµ', 'üá®üá®', 'üá®üá¥', 'üá∞üá≤', 'üá®üá¨', 'üá®üá©', 'üá®üá∞', 'üá®üá∑', 'üá≠üá∑', 'üá®üá∫', 'üá®üáº', 'üá®üáæ', 'üá®üáø', 'üá®üáÆ', 'üá©üá∞', 'üá©üá¨', 'üá©üáØ', 'üá©üá≤', 'üá©üá¥', 'üîö', 'üá™üá®', 'üá™üá¨', 'üá∏üáª', 'üè¥

In [15]:
# Function to remove emojis from text
def remove_emojis(text, emoji_list):
    emoji_pattern = "|".join(map(re.escape, emoji_list))
    return re.sub(emoji_pattern, '', text)

#### Cleaning Emojis and extracting english comments and respective channelIds

**LOGIC:**
(for each 'snippet' row)

‚Ü™ extract *textOriginal* fields from snippet.

$\;\;\;\;$JSON DATA STRUCTURE FORMAT -

$\;\;\;\;${......*'topLevelComment'* : {..... *'snippet'* : {...... *'channelId'*: {**DESIRED_CHANNEL**}, ..... *'textOriginal'*: {**DESIRED_COMMENT**}}}}

‚Ü™ extract corresponding *channelId* fields from snippet as above. üëÜ

‚Ü™ remove emojis idenitifed in emoji_list.

‚Ü™ filter out comments that are detectd to be of English language

‚Ü™ add filtered comments and respective channelIds into a new dataframe

In [16]:
# emoji - cleaning

# List to store channelId and English comments
english_comments_list = []

# Dictionary to store 'all_count' and 'eng_count' for each channel
channel_info = {}

# Process each channel's comments
for index, row in combined_df.iterrows():
    # Extract textOriginal from snippet
    snippet = eval(row['snippet'])
    top_level_comments = snippet.get('topLevelComment', {}).get('snippet', {})
    text_original = top_level_comments.get('textOriginal', '')

    # Extract channelId
    channel_id = snippet.get('topLevelComment', {}).get('snippet', {}).get('channelId', None)

    # Update channel_info dictionary
    if channel_id not in channel_info.keys():
        channel_info[channel_id] = {'all_count': 0, 'eng_count': 0}
    channel_info[channel_id]['all_count'] += 1

    # Remove emojis
    cleaned_comment = remove_emojis(text_original, emoji_list)

    # Language detection
    try:
        lang = detect(cleaned_comment)
        if lang == 'en':
            english_comments_list.append({'channelId': channel_id, 'comment': cleaned_comment.lower()})
            channel_info[channel_id]['eng_count'] += 1
    except:
        pass

# Create DataFrame for English comments
english_comments_df = pd.DataFrame(english_comments_list)

english_comments_df.head(100)

Unnamed: 0,channelId,comment
0,UCet0ZrYmw-V_hsGPb7KsiOQ,"just stop pretendiiiiing u feel the same, if n..."
1,UCet0ZrYmw-V_hsGPb7KsiOQ,truly terrible.
2,UCet0ZrYmw-V_hsGPb7KsiOQ,in a dark candle lit hazy smoke filled basemen...
3,UCsT0YIqwnpJCM-mx7-gSA4Q,this is so great! thank you!
4,UCsT0YIqwnpJCM-mx7-gSA4Q,jerry seinfeld looks different
...,...,...
95,UCLjwRB4r2aZOAo-ftaWvP5A,we going to texas with this one
96,UCzzFtVXMzxD08Zej-qokESw,i love this song
97,UCUQw5JQgxAXcX1SvixOIH0w,it‚Äôs not real
98,UC1yNl2E66ZzKApQdRuTQ4tw,"all good, but it still can't get right the tim..."


In [17]:
english_comments_df

Unnamed: 0,channelId,comment
0,UCet0ZrYmw-V_hsGPb7KsiOQ,"just stop pretendiiiiing u feel the same, if n..."
1,UCet0ZrYmw-V_hsGPb7KsiOQ,truly terrible.
2,UCet0ZrYmw-V_hsGPb7KsiOQ,in a dark candle lit hazy smoke filled basemen...
3,UCsT0YIqwnpJCM-mx7-gSA4Q,this is so great! thank you!
4,UCsT0YIqwnpJCM-mx7-gSA4Q,jerry seinfeld looks different
...,...,...
55437,UCa_3ly2XPt-y3tkkEtGfpCw,2 months left to end 2023 & its still a master...
55438,UCJgDzk8RDB_gSsn0ZGxLODA,like i am going to say no to shrimp . thank y...
55439,UCTZmwwn0Wd5YsAxw0NZzyJA,"well, we could destroy nestle. peter brabeck-..."
55440,UCTZmwwn0Wd5YsAxw0NZzyJA,why are they fitting each other


NOTE:
- There are ~55k english comments extracted from the input file and cleaned for emojis

---

<br>
<br>

**‚úî CHECKPOINT #1 (for checking data)**

In [18]:
# Specify the path where you want to save the .txt file
file_path = 'check.txt'

# Write the DataFrame to a .txt file
english_comments_df.to_csv(file_path, sep='\t', index=False)

In [19]:
final_count = 0
for key, value in channel_info.items():
    if value['eng_count'] > 14:
      final_count += value['eng_count']
print("FINAL COUNT:", final_count)

FINAL COUNT: 51510


NOTE:
- There are more than ~51k english comments in total from channels with atleast 15 english comments


---
<br>
<br>


### STEP 3 - Outputting to .csv format (**Output File #1**)

We are finding the number of english comments and all comments made for each of the different channels from the input file

In [20]:
# Define the CSV file path
csv_file = '125_channel_list.csv'

# Define the headers for the CSV file
headers = ['channel_id', 'all_comment_count', 'eng_comment_count']

# Write dictionary to CSV file
with open(csv_file, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(headers)
    for channel_id, counts in channel_info.items():
        writer.writerow([channel_id, counts['all_count'], counts['eng_count']])

In [21]:
##### Step 2 & 3 COMPLETE!

<div class="alert alert-block alert-warning">
    
## STEP 4 : Generating Vocab - Unigram and Bigram Lists <a class="anchor" name="step4"></a>

### Generate Unigram Tokenlist

**LOGIC:**

‚Ü™ 1. Filter out english comments made in channels that have more than 15 english comments.

‚Ü™ 2. Word Tokenize the english comments using RegexTokenizer.

‚Ü™ 3. Clear context independent stopwords from the word tokens.

‚Ü™ 4. Stem the resultant token list using Porter Stemmer.

‚Ü™ 5. Clear context dependent stopwords from the stemmed tokens. These are stopwords that appear in more than 99% of the channels.

‚Ü™ 6. Remove rare tokens from the resultant stopwords removed, stemmed token lists. These are tokens that appear in less than 1% of the channels.

‚Ü™ 7. Remove tokens with size < 3 from the resultant stopwords to generate final versions of unigram token list.



---



These are my **reasonings** to choose the above order of token processing:
1. I choose to remove context independent stopwords before the context dependent ones so that words like 'is', 'are', etc. which are really context-independent are differentiated from words that don't belong in the *stopwords_en.txt* but appear unnaturally frequently in the corpus of english comments (though I didn't get any when checked).
2. I chose to add the stemming before removal context dependent stopwords so as to increase the frequency of some words (in their base or stemmed forms) and accurately classify them as context dependent stopwords in the next step. Though in the case of rare tokens, the number of such tokens is reduced by stemming beforehand, it does lead to better vocabulary formation.
3. I could have chosen to remove tokens with size < 3 before stemming but that would result in removal of some very common context independent stopwords such as 'a', 'is', 'he', etc, which is already being taken care of in step 3 above; also stemmed tokens with length less than 2, especially after having removed stopwords, are words that don't particularly mean anything. This is why I chose to apply this step at last.


In [22]:
# Step 1: Filter out Channels with more than 15 English comments
filtered_channels = [channel_id for channel_id, counts in channel_info.items() if counts['eng_count'] > 14]
filtered_comments_df = english_comments_df[english_comments_df['channelId'].isin(filtered_channels)].reset_index(drop=True)

filtered_comments_df.head(100)

Unnamed: 0,channelId,comment
0,UCet0ZrYmw-V_hsGPb7KsiOQ,"just stop pretendiiiiing u feel the same, if n..."
1,UCet0ZrYmw-V_hsGPb7KsiOQ,truly terrible.
2,UCet0ZrYmw-V_hsGPb7KsiOQ,in a dark candle lit hazy smoke filled basemen...
3,UCsT0YIqwnpJCM-mx7-gSA4Q,this is so great! thank you!
4,UCsT0YIqwnpJCM-mx7-gSA4Q,jerry seinfeld looks different
...,...,...
95,UCMSYZVlQmyG8_2MkIKzg0kw,i did the samething as her and imo ucla is eas...
96,UCMSYZVlQmyG8_2MkIKzg0kw,and yet shes one of the densest people on the ...
97,UCNqFDjYTexJDET3rPDrmJKg,he is a natural man he has so many good songs
98,UCq-Fj5jknLsUf-MWSy4_brA,ye song toh movie mein hona chahiye tha hop...


In [23]:
print(filtered_comments_df['channelId'].nunique())

1315


NOTE:
- there are 1315 unique channels that have atleast 15 english comments


---



In [24]:
# Step 2: Word tokenize using RegexpTokenizer
tokenizer = RegexpTokenizer(r"[a-zA-Z]+")
filtered_comments_df['tokens'] = filtered_comments_df['comment'].apply(lambda x: tokenizer.tokenize(x.lower()))

filtered_comments_df.head(100)

Unnamed: 0,channelId,comment,tokens
0,UCet0ZrYmw-V_hsGPb7KsiOQ,"just stop pretendiiiiing u feel the same, if n...","[just, stop, pretendiiiiing, u, feel, the, sam..."
1,UCet0ZrYmw-V_hsGPb7KsiOQ,truly terrible.,"[truly, terrible]"
2,UCet0ZrYmw-V_hsGPb7KsiOQ,in a dark candle lit hazy smoke filled basemen...,"[in, a, dark, candle, lit, hazy, smoke, filled..."
3,UCsT0YIqwnpJCM-mx7-gSA4Q,this is so great! thank you!,"[this, is, so, great, thank, you]"
4,UCsT0YIqwnpJCM-mx7-gSA4Q,jerry seinfeld looks different,"[jerry, seinfeld, looks, different]"
...,...,...,...
95,UCMSYZVlQmyG8_2MkIKzg0kw,i did the samething as her and imo ucla is eas...,"[i, did, the, samething, as, her, and, imo, uc..."
96,UCMSYZVlQmyG8_2MkIKzg0kw,and yet shes one of the densest people on the ...,"[and, yet, shes, one, of, the, densest, people..."
97,UCNqFDjYTexJDET3rPDrmJKg,he is a natural man he has so many good songs,"[he, is, a, natural, man, he, has, so, many, g..."
98,UCq-Fj5jknLsUf-MWSy4_brA,ye song toh movie mein hona chahiye tha hop...,"[ye, song, toh, movie, mein, hona, chahiye, th..."


**‚úî CHECKPOINT #2 (to check data generated)**

In [25]:
file_df = 'check2.txt'

filtered_comments_df.to_csv(file_df, sep='\t', index=False)

***Function to count no. of channels the token appears in***

In [26]:
# For removing context-dependent stopwords and rare tokens

def TTF(df_column):
  # Initialize a defaultdict to store the set of channels each token appears in
  token_channels = defaultdict(set)

  # Update the set of channels for each token
  for index, row in filtered_comments_df.iterrows():
      channel_id = row['channelId']
      tokens = row[df_column]
      for token in tokens:
          token_channels[token].add(channel_id)

  # Count the number of channels each token appears in
  total_token_freq = Counter()

  # Count the number of channels each token appears in
  for token, channels in token_channels.items():
        total_token_freq[token] = len(channels)

  return total_token_freq

# Identify context-dependent stopwords (words appearing in more than 99% of channels)
total_channels = len(filtered_channels)

In [27]:
# Step 3: Remove Context-independent stopwords
with open('/content/drive/Shareddrives/FIT5196_S1_2024/A1/stopwords_en.txt', 'r') as f:
    context_independent_stopwords = set(f.read().splitlines())

filtered_comments_df['cntInd-stopwords_removed_tokens'] = filtered_comments_df['tokens'].apply(lambda x: [token for token in x if token not in context_independent_stopwords])

filtered_comments_df.head(100)

Unnamed: 0,channelId,comment,tokens,cntInd-stopwords_removed_tokens
0,UCet0ZrYmw-V_hsGPb7KsiOQ,"just stop pretendiiiiing u feel the same, if n...","[just, stop, pretendiiiiing, u, feel, the, sam...","[stop, pretendiiiiing, feel, matters, blame, l..."
1,UCet0ZrYmw-V_hsGPb7KsiOQ,truly terrible.,"[truly, terrible]",[terrible]
2,UCet0ZrYmw-V_hsGPb7KsiOQ,in a dark candle lit hazy smoke filled basemen...,"[in, a, dark, candle, lit, hazy, smoke, filled...","[dark, candle, lit, hazy, smoke, filled, basem..."
3,UCsT0YIqwnpJCM-mx7-gSA4Q,this is so great! thank you!,"[this, is, so, great, thank, you]",[great]
4,UCsT0YIqwnpJCM-mx7-gSA4Q,jerry seinfeld looks different,"[jerry, seinfeld, looks, different]","[jerry, seinfeld]"
...,...,...,...,...
95,UCMSYZVlQmyG8_2MkIKzg0kw,i did the samething as her and imo ucla is eas...,"[i, did, the, samething, as, her, and, imo, uc...","[samething, imo, ucla, easily, times, harder, ..."
96,UCMSYZVlQmyG8_2MkIKzg0kw,and yet shes one of the densest people on the ...,"[and, yet, shes, one, of, the, densest, people...","[shes, densest, people, internet]"
97,UCNqFDjYTexJDET3rPDrmJKg,he is a natural man he has so many good songs,"[he, is, a, natural, man, he, has, so, many, g...","[natural, man, good, songs]"
98,UCq-Fj5jknLsUf-MWSy4_brA,ye song toh movie mein hona chahiye tha hop...,"[ye, song, toh, movie, mein, hona, chahiye, th...","[ye, song, toh, movie, mein, hona, chahiye, th..."


In [28]:
# Step 4: Stemmatize the words using Porter Stemmer
porter_stemmer = PorterStemmer()
filtered_comments_df['cntInd-SW_removed_stemmed_tokens'] = filtered_comments_df['cntInd-stopwords_removed_tokens'].apply(lambda x: [porter_stemmer.stem(token) for token in x])

filtered_comments_df.head(100)

Unnamed: 0,channelId,comment,tokens,cntInd-stopwords_removed_tokens,cntInd-SW_removed_stemmed_tokens
0,UCet0ZrYmw-V_hsGPb7KsiOQ,"just stop pretendiiiiing u feel the same, if n...","[just, stop, pretendiiiiing, u, feel, the, sam...","[stop, pretendiiiiing, feel, matters, blame, l...","[stop, pretendiiii, feel, matter, blame, love,..."
1,UCet0ZrYmw-V_hsGPb7KsiOQ,truly terrible.,"[truly, terrible]",[terrible],[terribl]
2,UCet0ZrYmw-V_hsGPb7KsiOQ,in a dark candle lit hazy smoke filled basemen...,"[in, a, dark, candle, lit, hazy, smoke, filled...","[dark, candle, lit, hazy, smoke, filled, basem...","[dark, candl, lit, hazi, smoke, fill, basement..."
3,UCsT0YIqwnpJCM-mx7-gSA4Q,this is so great! thank you!,"[this, is, so, great, thank, you]",[great],[great]
4,UCsT0YIqwnpJCM-mx7-gSA4Q,jerry seinfeld looks different,"[jerry, seinfeld, looks, different]","[jerry, seinfeld]","[jerri, seinfeld]"
...,...,...,...,...,...
95,UCMSYZVlQmyG8_2MkIKzg0kw,i did the samething as her and imo ucla is eas...,"[i, did, the, samething, as, her, and, imo, uc...","[samething, imo, ucla, easily, times, harder, ...","[sameth, imo, ucla, easili, time, harder, cc, ..."
96,UCMSYZVlQmyG8_2MkIKzg0kw,and yet shes one of the densest people on the ...,"[and, yet, shes, one, of, the, densest, people...","[shes, densest, people, internet]","[she, densest, peopl, internet]"
97,UCNqFDjYTexJDET3rPDrmJKg,he is a natural man he has so many good songs,"[he, is, a, natural, man, he, has, so, many, g...","[natural, man, good, songs]","[natur, man, good, song]"
98,UCq-Fj5jknLsUf-MWSy4_brA,ye song toh movie mein hona chahiye tha hop...,"[ye, song, toh, movie, mein, hona, chahiye, th...","[ye, song, toh, movie, mein, hona, chahiye, th...","[ye, song, toh, movi, mein, hona, chahiy, tha,..."


In [29]:
print(TTF('cntInd-SW_removed_stemmed_tokens'))

Counter({'love': 915, 'make': 875, 'good': 851, 'time': 829, 'don': 820, 'video': 800, 'peopl': 749, 'year': 687, 'watch': 680, 'great': 646, 'thing': 641, 'day': 603, 'work': 585, 've': 548, 'back': 546, 'made': 522, 'feel': 518, 'life': 513, 'world': 505, 'start': 501, 'guy': 491, 'man': 478, 'put': 472, 'live': 472, 'lol': 468, 'lot': 465, 'give': 463, 'end': 462, 'call': 452, 'bro': 450, 'amaz': 434, 'person': 433, 'real': 432, 'didn': 424, 'show': 420, 'stop': 411, 'thought': 409, 'im': 408, 'bad': 408, 'nice': 402, 'find': 394, 'long': 393, 'talk': 391, 'god': 385, 'understand': 374, 'part': 373, 'hope': 367, 'learn': 363, 'beauti': 363, 'll': 361, 'sound': 357, 'big': 350, 'doesn': 350, 'hard': 339, 'chang': 335, 'dont': 329, 'song': 329, 'play': 327, 'wait': 325, 'money': 323, 'friend': 320, 'point': 314, 'wow': 313, 'comment': 308, 'wrong': 307, 'place': 306, 'happen': 304, 'fact': 303, 'listen': 300, 'music': 297, 'read': 296, 'true': 296, 'interest': 294, 'youtub': 293, 'rea

In [30]:
print(total_channels)

1315


In [31]:
# Step 5: Remove Context-dependent stopwords
ttf_context_dep_stopwords = TTF('cntInd-SW_removed_stemmed_tokens')
context_dependent_stopwords = set(word for word, freq in ttf_context_dep_stopwords.items() if freq / total_channels > 0.99)

print(context_dependent_stopwords) # none

set()


In [32]:
filtered_comments_df['all_stopwords_removed_stemmed_tokens'] = filtered_comments_df['cntInd-SW_removed_stemmed_tokens'].apply(lambda x: [token for token in x if token not in context_dependent_stopwords])

filtered_comments_df.head(100)

Unnamed: 0,channelId,comment,tokens,cntInd-stopwords_removed_tokens,cntInd-SW_removed_stemmed_tokens,all_stopwords_removed_stemmed_tokens
0,UCet0ZrYmw-V_hsGPb7KsiOQ,"just stop pretendiiiiing u feel the same, if n...","[just, stop, pretendiiiiing, u, feel, the, sam...","[stop, pretendiiiiing, feel, matters, blame, l...","[stop, pretendiiii, feel, matter, blame, love,...","[stop, pretendiiii, feel, matter, blame, love,..."
1,UCet0ZrYmw-V_hsGPb7KsiOQ,truly terrible.,"[truly, terrible]",[terrible],[terribl],[terribl]
2,UCet0ZrYmw-V_hsGPb7KsiOQ,in a dark candle lit hazy smoke filled basemen...,"[in, a, dark, candle, lit, hazy, smoke, filled...","[dark, candle, lit, hazy, smoke, filled, basem...","[dark, candl, lit, hazi, smoke, fill, basement...","[dark, candl, lit, hazi, smoke, fill, basement..."
3,UCsT0YIqwnpJCM-mx7-gSA4Q,this is so great! thank you!,"[this, is, so, great, thank, you]",[great],[great],[great]
4,UCsT0YIqwnpJCM-mx7-gSA4Q,jerry seinfeld looks different,"[jerry, seinfeld, looks, different]","[jerry, seinfeld]","[jerri, seinfeld]","[jerri, seinfeld]"
...,...,...,...,...,...,...
95,UCMSYZVlQmyG8_2MkIKzg0kw,i did the samething as her and imo ucla is eas...,"[i, did, the, samething, as, her, and, imo, uc...","[samething, imo, ucla, easily, times, harder, ...","[sameth, imo, ucla, easili, time, harder, cc, ...","[sameth, imo, ucla, easili, time, harder, cc, ..."
96,UCMSYZVlQmyG8_2MkIKzg0kw,and yet shes one of the densest people on the ...,"[and, yet, shes, one, of, the, densest, people...","[shes, densest, people, internet]","[she, densest, peopl, internet]","[she, densest, peopl, internet]"
97,UCNqFDjYTexJDET3rPDrmJKg,he is a natural man he has so many good songs,"[he, is, a, natural, man, he, has, so, many, g...","[natural, man, good, songs]","[natur, man, good, song]","[natur, man, good, song]"
98,UCq-Fj5jknLsUf-MWSy4_brA,ye song toh movie mein hona chahiye tha hop...,"[ye, song, toh, movie, mein, hona, chahiye, th...","[ye, song, toh, movie, mein, hona, chahiye, th...","[ye, song, toh, movi, mein, hona, chahiy, tha,...","[ye, song, toh, movi, mein, hona, chahiy, tha,..."


In [33]:
print(TTF('all_stopwords_removed_stemmed_tokens'))

Counter({'love': 915, 'make': 875, 'good': 851, 'time': 829, 'don': 820, 'video': 800, 'peopl': 749, 'year': 687, 'watch': 680, 'great': 646, 'thing': 641, 'day': 603, 'work': 585, 've': 548, 'back': 546, 'made': 522, 'feel': 518, 'life': 513, 'world': 505, 'start': 501, 'guy': 491, 'man': 478, 'put': 472, 'live': 472, 'lol': 468, 'lot': 465, 'give': 463, 'end': 462, 'call': 452, 'bro': 450, 'amaz': 434, 'person': 433, 'real': 432, 'didn': 424, 'show': 420, 'stop': 411, 'thought': 409, 'im': 408, 'bad': 408, 'nice': 402, 'find': 394, 'long': 393, 'talk': 391, 'god': 385, 'understand': 374, 'part': 373, 'hope': 367, 'learn': 363, 'beauti': 363, 'll': 361, 'sound': 357, 'big': 350, 'doesn': 350, 'hard': 339, 'chang': 335, 'dont': 329, 'song': 329, 'play': 327, 'wait': 325, 'money': 323, 'friend': 320, 'point': 314, 'wow': 313, 'comment': 308, 'wrong': 307, 'place': 306, 'happen': 304, 'fact': 303, 'listen': 300, 'music': 297, 'read': 296, 'true': 296, 'interest': 294, 'youtub': 293, 'rea

In [34]:
# Step 6: Remove rare tokens from the vocab

ttf_rare_tokens = TTF('all_stopwords_removed_stemmed_tokens')
rare_tokens = set(token for token, freq in ttf_rare_tokens.items() if freq / total_channels < 0.01)

print(rare_tokens)

{'heric', 'cod', 'subnet', 'jpmwoaejz', 'alexandria', 'kindest', 'snoorkel', 'alfi', 'patato', 'casket', 'ahaha', 'reynisfjara', 'bron', 'homo', 'darshankuttana', 'autopsi', 'heccccck', 'wheelma', 'tfx', 'eunwoo', 'ascent', 'hutchorson', 'melbournian', 'trenberth', 'woki', 'havenot', 'andrad', 'magnitud', 'losi', 'panach', 'lare', 'beatric', 'lune', 'cheya', 'pokemon', 'badjust', 'sjvn', 'zakarya', 'whoah', 'riot', 'hairstyl', 'provabl', 'kruger', 'bharat', 'fiiiirst', 'sharopova', 'parma', 'fastai', 'mandan', 'sweetner', 'agahhaha', 'yepp', 'sociolog', 'pinnacl', 'mediaev', 'cornflow', 'minigun', 'upchei', 'xbook', 'median', 'afton', 'icar', 'majapahit', 'presser', 'suss', 'cureent', 'hello', 'monaco', 'harrison', 'codein', 'sser', 'stainless', 'pernici', 'parametr', 'agri', 'sidebar', 'hostag', 'sonot', 'camer', 'wolph', 'fentyn', 'servo', 'comm', 'reunit', 'mercuri', 'distro', 'peppermint', 'ballet', 'synchronis', 'confer', 'mcr', 'xiu', 'obviat', 'sho', 'shoum', 'sirtop', 'purrpray

In [35]:
filtered_comments_df['SW-rem_stemmed_RT_removed_tokens'] = filtered_comments_df['all_stopwords_removed_stemmed_tokens'].apply(lambda x: [token for token in x if token not in rare_tokens])

filtered_comments_df.head(100)

Unnamed: 0,channelId,comment,tokens,cntInd-stopwords_removed_tokens,cntInd-SW_removed_stemmed_tokens,all_stopwords_removed_stemmed_tokens,SW-rem_stemmed_RT_removed_tokens
0,UCet0ZrYmw-V_hsGPb7KsiOQ,"just stop pretendiiiiing u feel the same, if n...","[just, stop, pretendiiiiing, u, feel, the, sam...","[stop, pretendiiiiing, feel, matters, blame, l...","[stop, pretendiiii, feel, matter, blame, love,...","[stop, pretendiiii, feel, matter, blame, love,...","[stop, feel, matter, blame, love, shit, rock]"
1,UCet0ZrYmw-V_hsGPb7KsiOQ,truly terrible.,"[truly, terrible]",[terrible],[terribl],[terribl],[terribl]
2,UCet0ZrYmw-V_hsGPb7KsiOQ,in a dark candle lit hazy smoke filled basemen...,"[in, a, dark, candle, lit, hazy, smoke, filled...","[dark, candle, lit, hazy, smoke, filled, basem...","[dark, candl, lit, hazi, smoke, fill, basement...","[dark, candl, lit, hazi, smoke, fill, basement...","[dark, lit, smoke, fill, lay, floor, stare, sm..."
3,UCsT0YIqwnpJCM-mx7-gSA4Q,this is so great! thank you!,"[this, is, so, great, thank, you]",[great],[great],[great],[great]
4,UCsT0YIqwnpJCM-mx7-gSA4Q,jerry seinfeld looks different,"[jerry, seinfeld, looks, different]","[jerry, seinfeld]","[jerri, seinfeld]","[jerri, seinfeld]",[]
...,...,...,...,...,...,...,...
95,UCMSYZVlQmyG8_2MkIKzg0kw,i did the samething as her and imo ucla is eas...,"[i, did, the, samething, as, her, and, imo, uc...","[samething, imo, ucla, easily, times, harder, ...","[sameth, imo, ucla, easili, time, harder, cc, ...","[sameth, imo, ucla, easili, time, harder, cc, ...","[imo, easili, time, harder, liber, art, major,..."
96,UCMSYZVlQmyG8_2MkIKzg0kw,and yet shes one of the densest people on the ...,"[and, yet, shes, one, of, the, densest, people...","[shes, densest, people, internet]","[she, densest, peopl, internet]","[she, densest, peopl, internet]","[she, peopl, internet]"
97,UCNqFDjYTexJDET3rPDrmJKg,he is a natural man he has so many good songs,"[he, is, a, natural, man, he, has, so, many, g...","[natural, man, good, songs]","[natur, man, good, song]","[natur, man, good, song]","[natur, man, good, song]"
98,UCq-Fj5jknLsUf-MWSy4_brA,ye song toh movie mein hona chahiye tha hop...,"[ye, song, toh, movie, mein, hona, chahiye, th...","[ye, song, toh, movie, mein, hona, chahiye, th...","[ye, song, toh, movi, mein, hona, chahiy, tha,...","[ye, song, toh, movi, mein, hona, chahiy, tha,...","[ye, song, movi, tha, hope, releas]"


In [36]:
# Step 7: Remove tokens with a length less than 3
filtered_comments_df['final_unigram_tokens'] = filtered_comments_df['SW-rem_stemmed_RT_removed_tokens'].apply(lambda x: [token for token in x if len(token) >= 3])

filtered_comments_df.head(100)

Unnamed: 0,channelId,comment,tokens,cntInd-stopwords_removed_tokens,cntInd-SW_removed_stemmed_tokens,all_stopwords_removed_stemmed_tokens,SW-rem_stemmed_RT_removed_tokens,final_unigram_tokens
0,UCet0ZrYmw-V_hsGPb7KsiOQ,"just stop pretendiiiiing u feel the same, if n...","[just, stop, pretendiiiiing, u, feel, the, sam...","[stop, pretendiiiiing, feel, matters, blame, l...","[stop, pretendiiii, feel, matter, blame, love,...","[stop, pretendiiii, feel, matter, blame, love,...","[stop, feel, matter, blame, love, shit, rock]","[stop, feel, matter, blame, love, shit, rock]"
1,UCet0ZrYmw-V_hsGPb7KsiOQ,truly terrible.,"[truly, terrible]",[terrible],[terribl],[terribl],[terribl],[terribl]
2,UCet0ZrYmw-V_hsGPb7KsiOQ,in a dark candle lit hazy smoke filled basemen...,"[in, a, dark, candle, lit, hazy, smoke, filled...","[dark, candle, lit, hazy, smoke, filled, basem...","[dark, candl, lit, hazi, smoke, fill, basement...","[dark, candl, lit, hazi, smoke, fill, basement...","[dark, lit, smoke, fill, lay, floor, stare, sm...","[dark, lit, smoke, fill, lay, floor, stare, sm..."
3,UCsT0YIqwnpJCM-mx7-gSA4Q,this is so great! thank you!,"[this, is, so, great, thank, you]",[great],[great],[great],[great],[great]
4,UCsT0YIqwnpJCM-mx7-gSA4Q,jerry seinfeld looks different,"[jerry, seinfeld, looks, different]","[jerry, seinfeld]","[jerri, seinfeld]","[jerri, seinfeld]",[],[]
...,...,...,...,...,...,...,...,...
95,UCMSYZVlQmyG8_2MkIKzg0kw,i did the samething as her and imo ucla is eas...,"[i, did, the, samething, as, her, and, imo, uc...","[samething, imo, ucla, easily, times, harder, ...","[sameth, imo, ucla, easili, time, harder, cc, ...","[sameth, imo, ucla, easili, time, harder, cc, ...","[imo, easili, time, harder, liber, art, major,...","[imo, easili, time, harder, liber, art, major,..."
96,UCMSYZVlQmyG8_2MkIKzg0kw,and yet shes one of the densest people on the ...,"[and, yet, shes, one, of, the, densest, people...","[shes, densest, people, internet]","[she, densest, peopl, internet]","[she, densest, peopl, internet]","[she, peopl, internet]","[she, peopl, internet]"
97,UCNqFDjYTexJDET3rPDrmJKg,he is a natural man he has so many good songs,"[he, is, a, natural, man, he, has, so, many, g...","[natural, man, good, songs]","[natur, man, good, song]","[natur, man, good, song]","[natur, man, good, song]","[natur, man, good, song]"
98,UCq-Fj5jknLsUf-MWSy4_brA,ye song toh movie mein hona chahiye tha hop...,"[ye, song, toh, movie, mein, hona, chahiye, th...","[ye, song, toh, movie, mein, hona, chahiye, th...","[ye, song, toh, movi, mein, hona, chahiy, tha,...","[ye, song, toh, movi, mein, hona, chahiy, tha,...","[ye, song, movi, tha, hope, releas]","[song, movi, tha, hope, releas]"


**‚úî CHECKPOINT #3 (for checking the final unigram tokenlists)**

In [37]:
all_tokenfile_df = 'check3.txt'

filtered_comments_df.to_csv(all_tokenfile_df, sep='\t', index=False)

### Generate top 200 meaningful Bigram Tokens

The meaningfullness of the bigrams from the word tokens is measured by PMI (Pointwise Mutual Information). It measures the ratio of observed collocationness of two words (appearing next to each other) compared to expected collocationness of these two words if they were just some twor randomly picked words.

The bigrams are generated from the list of word tokens obtained after applying word tokenization on english comments.

üåü The top 200 bigrams with the highest PMI are picked after ranking them in descending order after checking if they actually have atleast 10 collocations in token list (this is not their real token-frequencies). Actual token-frequencies are calculated wrt comments

*(I have done this so as to pick only those bigrams that are so common that they deserve to be added to the vocabulary and fasten up collocation checking.)*

In [38]:
# Calculate PMI
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_documents(filtered_comments_df['tokens'])
finder.apply_freq_filter(10) # atleast 10 collocations in token list
pmi_scores = finder.score_ngrams(bigram_measures.pmi)

# Sort bigrams based on PMI scores
sorted_bigrams = sorted(pmi_scores, key=lambda x: x[1], reverse=True)

print(len(sorted_bigrams))

11641


#### Filtering out bigrams that have a PMI more than 5

*(I am doing this to simplify the next step of filtering out from these meaningful bigrams, those that actually have collocations)*

In [39]:
filt_sorted_bigrams = list(filter(lambda x: x[1] > 5, sorted_bigrams))

print(len(filt_sorted_bigrams))

1466


In [40]:
# Searching for rows of comments that have the Bigram collocations

# Create a regex pattern for each bigram
patterns = [r"\b{}\s+{}\b".format(re.escape(bigram[0][0]), re.escape(bigram[0][1])) for bigram in filt_sorted_bigrams]

# Combine all patterns into a single pattern
combined_pattern = "|".join(patterns)

# Search for the combined pattern in the 'comment' column
matches = filtered_comments_df['comment'].str.contains(combined_pattern, regex=True)

# Filter out rows with True matches
matched_comments_df = filtered_comments_df[matches]

print(len(matched_comments_df))
print(patterns)

23503
['\\bsri\\s+lanka\\b', '\\btracer\\s+width\\b', '\\bsci\\s+fi\\b', '\\biiec\\s+pydroid\\b', '\\bru\\s+iiec\\b', '\\bblah\\s+blah\\b', '\\bhip\\s+hop\\b', '\\bwhatcha\\s+gon\\b', '\\bhideturtle\\s+bgcolor\\b', '\\bthou\\s+shalt\\b', '\\bcotton\\s+eyed\\b', '\\bmasonic\\s+mafia\\b', '\\bmaple\\s+syrup\\b', '\\bmoluwu\\s+politically\\b', '\\bnaive\\s+bayes\\b', '\\bwaka\\s+waka\\b', '\\bfa\\s+fa\\b', '\\bel\\s+nido\\b', '\\bsore\\s+throat\\b', '\\bbeep\\s+beep\\b', '\\bmm\\s+mm\\b', '\\bcolorsys\\s+hsv\\b', '\\bpydroid\\s+files\\b', '\\bneural\\s+networks\\b', '\\bfossil\\s+fuels\\b', '\\btom\\s+riddle\\b', '\\btaylor\\s+swift\\b', '\\bmafia\\s+mission\\b', '\\bplot\\s+twist\\b', '\\bperiodic\\s+table\\b', '\\bsolar\\s+panels\\b', '\\beuphrates\\s+river\\b', '\\bmichael\\s+jackson\\b', '\\balbert\\s+einstein\\b', '\\bclick\\s+bait\\b', '\\bsoviet\\s+union\\b', '\\bwhoa\\s+whoa\\b', '\\buser\\s+ru\\b', '\\bmetro\\s+tunnel\\b', '\\butm\\s+campaign\\b', '\\beyed\\s+joe\\b', '\\bstatic\

In [41]:
def check_collocation(text, x, y):
    pattern = r"\b{}\s+{}\b".format(re.escape(x), re.escape(y))
    return bool(re.search(pattern, text))

#### Filtering out bigrams that actually have collocations

In [42]:
top_meaningful_bigrams = []

for meaningful_bigram in filt_sorted_bigrams:
  x = meaningful_bigram[0][0]
  y = meaningful_bigram[0][1]
  check_bool = False
  # print(x)
  for index, row in matched_comments_df.iterrows():
    if check_collocation(row['comment'], x, y):
      check_bool = True
      break

  if check_bool == True:
    top_meaningful_bigrams.append(meaningful_bigram)

print(top_meaningful_bigrams[:10])

[(('sri', 'lanka'), 15.886842351436671), (('sci', 'fi'), 15.793732947045191), (('blah', 'blah'), 15.341273289015604), (('hip', 'hop'), 15.306339498906215), (('whatcha', 'gon'), 15.211277301934606), (('thou', 'shalt'), 14.948242896100815), (('masonic', 'mafia'), 14.863353998514302), (('maple', 'syrup'), 14.85709500804262), (('moluwu', 'politically'), 14.827948662383102), (('naive', 'bayes'), 14.827948662383102)]


In [43]:
print(len(top_meaningful_bigrams))

1362


#### Filtering out top 200 bigram-collocations

In [44]:
bigram_token_list = []

# Print top 200 meaningful bigrams
print("Top 200 Meaningful Bigrams:")
for bigram, pmi in sorted(top_meaningful_bigrams, key=lambda x: x[1], reverse=True)[:200]:
    print(bigram, pmi)
    bigram_token_list.append(bigram)

Top 200 Meaningful Bigrams:
('sri', 'lanka') 15.886842351436671
('sci', 'fi') 15.793732947045191
('blah', 'blah') 15.341273289015604
('hip', 'hop') 15.306339498906215
('whatcha', 'gon') 15.211277301934606
('thou', 'shalt') 14.948242896100815
('masonic', 'mafia') 14.863353998514302
('maple', 'syrup') 14.85709500804262
('moluwu', 'politically') 14.827948662383102
('naive', 'bayes') 14.827948662383102
('waka', 'waka') 14.734839257991622
('fa', 'fa') 14.716917349994358
('el', 'nido') 14.533205396821971
('sore', 'throat') 14.486911744548035
('beep', 'beep') 14.360296536940835
('mm', 'mm') 14.356327634737893
('neural', 'networks') 14.038845444881721
('fossil', 'fuels') 13.997873663825416
('tom', 'riddle') 13.95723167932807
('taylor', 'swift') 13.943529302509337
('mafia', 'mission') 13.927484335934015
('plot', 'twist') 13.918678438558674
('periodic', 'table') 13.866942793998966
('solar', 'panels') 13.647376416741283
('euphrates', 'river') 13.64737641674128
('michael', 'jackson') 13.6353035844

In [45]:
# Initialize MWETokenizer
mwetokenizer = MWETokenizer()
bigram_meaningful_tokens = mwetokenizer.tokenize(['_'.join(bigram) for bigram in bigram_token_list])

In [46]:
print(bigram_meaningful_tokens)

['sri_lanka', 'sci_fi', 'blah_blah', 'hip_hop', 'whatcha_gon', 'thou_shalt', 'masonic_mafia', 'maple_syrup', 'moluwu_politically', 'naive_bayes', 'waka_waka', 'fa_fa', 'el_nido', 'sore_throat', 'beep_beep', 'mm_mm', 'neural_networks', 'fossil_fuels', 'tom_riddle', 'taylor_swift', 'mafia_mission', 'plot_twist', 'periodic_table', 'solar_panels', 'euphrates_river', 'michael_jackson', 'albert_einstein', 'click_bait', 'soviet_union', 'metro_tunnel', 'eyed_joe', 'static_void', 'carbon_dioxide', 'pop_mage', 'eh_eh', 'industrial_revolution', 'merry_christmas', 'string_args', 'la_la', 'begotten_son', 'yule_ball', 'cyber_truck', 'vast_majority', 'private_jets', 'olive_oil', 'elon_musk', 'ice_spice', 'bill_gates', 'professional_certificate', 'mac_os', 'fossil_fuel', 'stephen_king', 'import_colorsys', 'gen_z', 'draco_malfoy', 'ha_ha', 'zodiac_sign', 'short_sighted', 'global_warming', 'artificial_intelligence', 'donald_trump', 'corporate_greed', 'uh_huh', 'ice_cube', 'united_states', 'daily_basis',

In [47]:
def list_to_txt(lst, filename):
    with open(filename, 'w') as f:
        for item in lst:
            f.write(str(item) + '\n')

list_to_txt(bigram_meaningful_tokens, '125_bigrams.txt') # final list of top 200 meaningful bigrams

<div class="alert alert-block alert-warning">
    
### Vocabulary List - Combine generated unigram and bigram tokens <a class="anchor" name="write-vocab"></a>

In [48]:
# Extract unique unigram tokens from final_unigram_tokens
unique_unigram_tokens = set(token for tokens in filtered_comments_df['final_unigram_tokens'] for token in tokens)

# Add bigram_meaningful_tokens list to unique_unigram_tokens
unique_unigram_tokens.update(bigram_meaningful_tokens)

# Sort all tokens alphabetically
vocab = sorted(unique_unigram_tokens)

# Display the sorted tokens
print(type(vocab))
print(vocab)
print(len(vocab))


<class 'list'>
['abandon', 'abil', 'abroad', 'absolut', 'absorb', 'abstract', 'absurd', 'abt', 'abund', 'abus', 'academ', 'academi', 'acceler', 'accent', 'accept', 'access', 'accid', 'accident', 'accomplish', 'account', 'accur', 'accuraci', 'ach', 'achiev', 'acid', 'acknowledg', 'acquir', 'act', 'action', 'activ', 'activist', 'actor', 'actress', 'actual', 'adam', 'adapt', 'add', 'addict', 'addit', 'address', 'adhd', 'adjust', 'admin', 'administr', 'admir', 'admit', 'adopt', 'ador', 'adult', 'advanc', 'advantag', 'adventur', 'advertis', 'advic', 'advis', 'advoc', 'aesthet', 'affect', 'afford', 'afraid', 'africa', 'african', 'afternoon', 'age', 'agenda', 'agent', 'aggress', 'ago', 'agre', 'agricultur', 'ah_ah', 'ahead', 'ahh', 'ai_agents', 'aid', 'aim', 'ain', 'aint', 'air', 'airport', 'aka', 'alarm', 'albert_einstein', 'album', 'alcohol', 'alert', 'algorithm', 'ali', 'alien', 'align', 'aliv', 'allah', 'alli', 'allow', 'almighti', 'alot', 'alright', 'altern', 'amaz', 'amazingli', 'amazon

### **Outputting to vocab.txt (Output File #2)**

In [49]:
# Output sorted tokens to a .txt file
vocab_main = {}
with open('125_vocab.txt', 'w') as file:
    for index, token in enumerate(vocab):
      vocab_main[token] = index
      file.write(f"{token}:{index}\n")


In [50]:
print(vocab_main)

{'abandon': 0, 'abil': 1, 'abroad': 2, 'absolut': 3, 'absorb': 4, 'abstract': 5, 'absurd': 6, 'abt': 7, 'abund': 8, 'abus': 9, 'academ': 10, 'academi': 11, 'acceler': 12, 'accent': 13, 'accept': 14, 'access': 15, 'accid': 16, 'accident': 17, 'accomplish': 18, 'account': 19, 'accur': 20, 'accuraci': 21, 'ach': 22, 'achiev': 23, 'acid': 24, 'acknowledg': 25, 'acquir': 26, 'act': 27, 'action': 28, 'activ': 29, 'activist': 30, 'actor': 31, 'actress': 32, 'actual': 33, 'adam': 34, 'adapt': 35, 'add': 36, 'addict': 37, 'addit': 38, 'address': 39, 'adhd': 40, 'adjust': 41, 'admin': 42, 'administr': 43, 'admir': 44, 'admit': 45, 'adopt': 46, 'ador': 47, 'adult': 48, 'advanc': 49, 'advantag': 50, 'adventur': 51, 'advertis': 52, 'advic': 53, 'advis': 54, 'advoc': 55, 'aesthet': 56, 'affect': 57, 'afford': 58, 'afraid': 59, 'africa': 60, 'african': 61, 'afternoon': 62, 'age': 63, 'agenda': 64, 'agent': 65, 'aggress': 66, 'ago': 67, 'agre': 68, 'agricultur': 69, 'ah_ah': 70, 'ahead': 71, 'ahh': 72

In [51]:
##### Step 4 COMPLETE!

<div class="alert alert-block alert-success">
    
## STEP 5 : Sparse Matrix and CountVec <a class="anchor" name="step5"></a>

### Re-tokenizing text based on bigrams

In [52]:
new_mwe = MWETokenizer(bigram_token_list)
filtered_comments_df['re-tokenized_tokens'] = filtered_comments_df['tokens'].apply(lambda x: new_mwe.tokenize(x))

filtered_comments_df.head(100)

Unnamed: 0,channelId,comment,tokens,cntInd-stopwords_removed_tokens,cntInd-SW_removed_stemmed_tokens,all_stopwords_removed_stemmed_tokens,SW-rem_stemmed_RT_removed_tokens,final_unigram_tokens,re-tokenized_tokens
0,UCet0ZrYmw-V_hsGPb7KsiOQ,"just stop pretendiiiiing u feel the same, if n...","[just, stop, pretendiiiiing, u, feel, the, sam...","[stop, pretendiiiiing, feel, matters, blame, l...","[stop, pretendiiii, feel, matter, blame, love,...","[stop, pretendiiii, feel, matter, blame, love,...","[stop, feel, matter, blame, love, shit, rock]","[stop, feel, matter, blame, love, shit, rock]","[just, stop, pretendiiiiing, u, feel, the, sam..."
1,UCet0ZrYmw-V_hsGPb7KsiOQ,truly terrible.,"[truly, terrible]",[terrible],[terribl],[terribl],[terribl],[terribl],"[truly, terrible]"
2,UCet0ZrYmw-V_hsGPb7KsiOQ,in a dark candle lit hazy smoke filled basemen...,"[in, a, dark, candle, lit, hazy, smoke, filled...","[dark, candle, lit, hazy, smoke, filled, basem...","[dark, candl, lit, hazi, smoke, fill, basement...","[dark, candl, lit, hazi, smoke, fill, basement...","[dark, lit, smoke, fill, lay, floor, stare, sm...","[dark, lit, smoke, fill, lay, floor, stare, sm...","[in, a, dark, candle, lit, hazy, smoke, filled..."
3,UCsT0YIqwnpJCM-mx7-gSA4Q,this is so great! thank you!,"[this, is, so, great, thank, you]",[great],[great],[great],[great],[great],"[this, is, so, great, thank, you]"
4,UCsT0YIqwnpJCM-mx7-gSA4Q,jerry seinfeld looks different,"[jerry, seinfeld, looks, different]","[jerry, seinfeld]","[jerri, seinfeld]","[jerri, seinfeld]",[],[],"[jerry, seinfeld, looks, different]"
...,...,...,...,...,...,...,...,...,...
95,UCMSYZVlQmyG8_2MkIKzg0kw,i did the samething as her and imo ucla is eas...,"[i, did, the, samething, as, her, and, imo, uc...","[samething, imo, ucla, easily, times, harder, ...","[sameth, imo, ucla, easili, time, harder, cc, ...","[sameth, imo, ucla, easili, time, harder, cc, ...","[imo, easili, time, harder, liber, art, major,...","[imo, easili, time, harder, liber, art, major,...","[i, did, the, samething, as, her, and, imo, uc..."
96,UCMSYZVlQmyG8_2MkIKzg0kw,and yet shes one of the densest people on the ...,"[and, yet, shes, one, of, the, densest, people...","[shes, densest, people, internet]","[she, densest, peopl, internet]","[she, densest, peopl, internet]","[she, peopl, internet]","[she, peopl, internet]","[and, yet, shes, one, of, the, densest, people..."
97,UCNqFDjYTexJDET3rPDrmJKg,he is a natural man he has so many good songs,"[he, is, a, natural, man, he, has, so, many, g...","[natural, man, good, songs]","[natur, man, good, song]","[natur, man, good, song]","[natur, man, good, song]","[natur, man, good, song]","[he, is, a, natural, man, he, has, so, many, g..."
98,UCq-Fj5jknLsUf-MWSy4_brA,ye song toh movie mein hona chahiye tha hop...,"[ye, song, toh, movie, mein, hona, chahiye, th...","[ye, song, toh, movie, mein, hona, chahiye, th...","[ye, song, toh, movi, mein, hona, chahiy, tha,...","[ye, song, toh, movi, mein, hona, chahiy, tha,...","[ye, song, movi, tha, hope, releas]","[song, movi, tha, hope, releas]","[ye, song, toh, movie, mein, hona, chahiye, th..."


In [53]:
# Search for rows where 'taylor_swift' is present in the 're-tokenized_tokens' column
matching_rows = filtered_comments_df[filtered_comments_df['re-tokenized_tokens'].apply(lambda tokens: 'taylor_swift' in tokens)]

# Extract the channelId from the matching rows
matching_channelIds = matching_rows['channelId'].unique()

# Print the channelId(s) that contain 'taylor_swift'
print("ChannelIds containing 'taylor_swift':", matching_channelIds)

ChannelIds containing 'taylor_swift': ['UCANLZYMidaCbLQFWXBC95Jg' 'UCzqKhRhaE0fNl_Lg9jTaXbQ'
 'UCtWWry0cgjWi9C0Ye9sh6og' 'UCWI1AJmkMm_sbTX_Ps8gQAw'
 'UCWBWgCD4oAqT3hUeq40SCUw' 'UC-AlofdKECUdhXrbJQZ6iEg'
 'UCja8sZ2T4ylIqjggA1Zuukg' 'UC02rAIf5RYrww6AWhoI5S0w'
 'UCCzmXhARScj7wtJ0f2co64A' 'UCtBAlQkiTnjqzkSmO62iDvA'
 'UCk5TtU79zpTNmQrSUUIm25A' 'UCIHdDJ0tjn_3j-FS7s_X1kQ'
 'UCX6b17PVsYBQ0ip5gyeme-Q' 'UCFVM_zMWzOCMaQii3fCnCKA'
 'UCO0akufu9MOzyz3nvGIXAAw' 'UCcNPbOeFo-qM0wpis8Lwdig']


### Combining all comments under the same channel

In [54]:
# Group the DataFrame by 'channelId' and concatenate the token lists
combined_tokens_df = filtered_comments_df.groupby('channelId')['re-tokenized_tokens'].apply(lambda x: sum(x, [])).reset_index()

# Rename the column
combined_tokens_df.columns = ['channelId', 'combined_tokens']

# Display the resulting DataFrame
combined_tokens_df


Unnamed: 0,channelId,combined_tokens
0,UC-3sBKh8YYbG2KyVHnSyA1A,"[my, favourite, movie, is, beauty, and, the, b..."
1,UC-9b7aDP6ZN0coj9-xFnrtw,"[has, anyone, noticed, the, deceptive, attempt..."
2,UC-AlofdKECUdhXrbJQZ6iEg,"[where, is, everyone, listening, from, i, love..."
3,UC-B0ARaD-Y0p95aAqbbawQQ,"[i, still, use, that, song, for, the, helicopt..."
4,UC-CSyyi47VX1lD9zyeABW3w,"[i, want, to, become, a, footballer, i, tried,..."
...,...,...
1310,UCznj32AM2r98hZfTxrRo9bQ,"[i, really, love, this, lesson, from, someone,..."
1311,UCzqKhRhaE0fNl_Lg9jTaXbQ,"[best, upload, yet, my, son, i, were, there, n..."
1312,UCzsEAkRcIsKpd1WaJnRR97A,"[thx, ur, the, best, ilove, you, thxxxxx, so, ..."
1313,UCzuqE7-t13O4NIDYJfakrhw,"[our, constitutional, right, to, free, speech,..."


In [55]:
for index, row in combined_tokens_df.iterrows():
    combined_tokens_df.at[index, 'token_string'] = ' '.join(row['combined_tokens'])

print(combined_tokens_df.head(100))

                   channelId  \
0   UC-3sBKh8YYbG2KyVHnSyA1A   
1   UC-9b7aDP6ZN0coj9-xFnrtw   
2   UC-AlofdKECUdhXrbJQZ6iEg   
3   UC-B0ARaD-Y0p95aAqbbawQQ   
4   UC-CSyyi47VX1lD9zyeABW3w   
..                       ...   
95  UC2xj044cgBpoXOnfQZhO5Yg   
96  UC2yq41wib1MOacBef2crlbw   
97  UC2zuipdDU9Dp5FFPtZ-y3rQ   
98  UC362wXiUbL4_okVgEPdHqmQ   
99  UC3DFdy_qc-cqgKCyQTHLGzA   

                                      combined_tokens  \
0   [my, favourite, movie, is, beauty, and, the, b...   
1   [has, anyone, noticed, the, deceptive, attempt...   
2   [where, is, everyone, listening, from, i, love...   
3   [i, still, use, that, song, for, the, helicopt...   
4   [i, want, to, become, a, footballer, i, tried,...   
..                                                ...   
95  [put, the, update, you, copied, form, banana, ...   
96  [i, have, a, hand, pump, sealer, where, you, d...   
97  [windows, shutting, down, that, st, and, rd, a...   
98  [th, th, ka, ncert, books, please, reply,

### Generating Sparse Representation (CountVec)

In [56]:
# Define a custom tokenizer function
def custom_tokenizer(doc):
    return doc.split()  # Split the document into tokens based on whitespace

# Initialize CountVectorizer with the vocabulary and custom tokenizer
vectorizer = CountVectorizer(vocabulary=vocab_main, analyzer = "word")

# Fit and transform the combined_tokens column
sparse_representation = vectorizer.fit_transform(combined_tokens_df['token_string'])

# Convert the sparse representation to a DataFrame
sparse_df = pd.DataFrame.sparse.from_spmatrix(sparse_representation)

# Set the column names to the vocabulary tokens
sparse_df.columns = vocab

# Concatenate the sparse representation with the channelId column
sparse_df = pd.concat([combined_tokens_df[['channelId']], sparse_df], axis=1)

# Display the resulting DataFrame
print(sparse_df)


                     channelId  abandon  abil  abroad  absolut  absorb  \
0     UC-3sBKh8YYbG2KyVHnSyA1A        0     0       0        0       0   
1     UC-9b7aDP6ZN0coj9-xFnrtw        0     0       0        0       0   
2     UC-AlofdKECUdhXrbJQZ6iEg        0     0       0        0       0   
3     UC-B0ARaD-Y0p95aAqbbawQQ        0     0       0        0       0   
4     UC-CSyyi47VX1lD9zyeABW3w        0     0       0        0       0   
...                        ...      ...   ...     ...      ...     ...   
1310  UCznj32AM2r98hZfTxrRo9bQ        0     0       0        0       0   
1311  UCzqKhRhaE0fNl_Lg9jTaXbQ        0     0       0        0       0   
1312  UCzsEAkRcIsKpd1WaJnRR97A        0     0       0        0       0   
1313  UCzuqE7-t13O4NIDYJfakrhw        0     0       0        0       0   
1314  UCzzFtVXMzxD08Zej-qokESw        0     0       0        0       0   

      abstract  absurd  abt  abund  ...  youth  youtu  youtub  yr_old  \
0            0       0    0      0  ..

In [57]:
# cheking sparse reprsentation example
vocab2 = vectorizer.get_feature_names_out()
for word, count in zip(vocab2, sparse_representation.toarray()[1]):
    if count > 0:
        print (word, ":", count)

action : 2
add : 2
admit : 1
affect : 2
afraid : 1
age : 1
agenda : 4
ago : 3
air : 3
alarm : 1
alert : 1
ancient : 1
answer : 1
argument : 1
back : 6
behavior : 2
benefit : 1
bias : 1
billion : 1
blame : 1
breath : 1
btw : 1
build : 1
call : 1
canada : 1
carbon : 2
carbon_dioxide : 1
care : 1
catch : 1
child : 1
china : 2
claim : 1
clean : 1
cleaner : 1
climate_change : 5
coal : 1
contain : 1
control : 2
correct : 1
corrupt : 1
crucial : 1
cut : 2
data : 3
date : 1
day : 1
didn : 1
die : 1
digit : 1
dirt : 1
doesnt : 1
dollar : 2
don : 1
due : 2
e_g : 2
earth : 8
eat : 1
effect : 1
effort : 1
elect : 1
element : 2
environment : 3
even : 7
expect : 1
extra : 1
fall : 1
find : 3
food : 1
fossil_fuel : 1
found : 6
free : 1
fresh : 1
garden : 1
global : 6
global_warming : 2
good : 2
graph : 2
great : 3
green : 1
group : 1
half : 1
haven : 1
heard : 1
heat : 1
help : 2
hey : 1
holiday : 1
home : 1
huge : 1
human : 3
human_beings : 1
ice : 1
ice_age : 3
idea : 1
ill : 1
impact : 4
incorrect

### **Outputting to countvec.txt (Output File #3)**


In [58]:
# Define the filename for the output text file
output_file = '125_countvec.txt'

# Open the file in write mode
with open(output_file, 'w') as file:
    # Iterate over the rows of the sparse DataFrame
    for index, row in sparse_df.iterrows():
        # Write the channel ID to the file
        file.write(f"{row['channelId']}")

        # Iterate over the columns (token indices) and their frequencies
        for column, frequency in row.items():
          if (column != 'channelId'):
            # Write the token index and its frequency to the file
            if frequency > 0:
              file.write(',')
              file.write(f"{vocab_main[column]}:{frequency}")

        # End the line for the current channel ID
        file.write("\n")

print(f"Sparse representation has been written to {output_file}.")


Sparse representation has been written to 125_countvec.txt.


In [59]:
##### Step 5 DONE! TASK 2 DONE!!!!

<div class="alert alert-block alert-success">
    
## 7. Summary <a class="anchor" name="summary"></a>

- I have successfully extracted english comments from the input file of those channels with atleast 15 english comments under them and cleaned them of any emoji characters.

- I have performed text processing on these extracted and cleaned english comments to genearate unigram token list and top 200 meaningful bigram token list.

- I have created a vocabulary from these two generated lists and also found the token-frequencies of occurance of these tokens in the comments of each channel.



---



<div class="alert alert-block alert-success">
    
## 8. References <a class="anchor" name="Ref"></a>

[1] Pandas Dataframe handling, https://www.geeksforgeeks.org/python-pandas-dataframe/

[2] Pandas dataframe.drop_duplicates(), https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/

[3] pandas.concat(), https://pandas.pydata.org/docs/reference/api/pandas.concat.html

[4] Collections library, https://docs.python.org/3/library/collections.html

[5] NLTK RegexpTokenizer, https://www.nltk.org/api/nltk.tokenize.regexp.html

[6] NLTK MWETokenizer, https://www.nltk.org/api/nltk.tokenize.mwe.html

[7] Bigram PMI Measuring, https://tedboy.github.io/nlps/generated/generated/nltk.BigramAssocMeasures.html

[8] Bigram Collocation Finding, https://tedboy.github.io/nlps/generated/generated/nltk.BigramAssocMeasures.html

[9] Detecting Language, https://www.geeksforgeeks.org/detect-an-unknown-language-using-python/

[10] Regex manipulation, functions and pattern matching, https://docs.python.org/3/howto/regex.html#

[11] PMI Concept, https://en.wikipedia.org/wiki/Pointwise_mutual_information

[12] CountVectorizer, https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

                                                                            # ALL DONE
                                                                                üôÇ