<a href="https://colab.research.google.com/github/Ryan-Ott/text-mining/blob/master/Group20_TM_FinalProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Mining - Final Project
Vrije Universiteit Amsterdam, March 2023

Group 20 | Antoni Stroinski, Elza Učelniece, Ryan Ott (rot280), Youssef Baccouche



This notebook serves as the basis of our analysis into three core disciplines in Natural Language Processing (NLP):

1.   Named Entity Recognition/Classification (NERC)
2.   Sentiment Analysis
3.   Topic Analysis

The goal of this project is to apply the skills obtained during the Text Mining course to some common NLP tasks and to compare the performance of different systems on a given task.

From data collection and inspection over model preparation and training to analysis and discussion, this work should serve as a good exercise to showcase our understanding of the complete modern NLP pipeline.

## Sentiment Analysis

### Data
To train our model we chose to use the [sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140) dataset found on kaggle. This is a collection of 1.6 million tweets extracted straight from Twitter using the Twitter API that was used for a [Stanfort paper on Twitter Sentiment Classification](https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf). Half the tweet are of postive sentiment, the other of negative.

Important to note for this dataset is that it does not contain "gold" standard, human generated labels but was rather machine labeled based on emoticons. If a positive emoticon was present in the tweet's text (like ":)" or ":-D") it was automatically labeled as positive and the emoticon itself was removed, as to make the model only learn the relationship between the written text and sentiment. The same was done for the negative class. If it contained both positive and negative emoticons it was not included as this alludes to the tweet being about different topics with differing sentiments.

Important for our analysis are only the "target" and "text" columns. The target is the sentiment class: "postive" (initially saved as 4) or "negative" (initially saved as a 0). The text is the tweet itself.

In [None]:
!pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import os
os.environ['KAGGLE_USERNAME'] = "ryanott"
os.environ['KAGGLE_KEY'] = "629c69903b056199bb657b4062eb8a37"

In [4]:
!kaggle datasets download -d kazanova/sentiment140

Downloading sentiment140.zip to /content
 85% 69.0M/80.9M [00:00<00:00, 78.7MB/s]
100% 80.9M/80.9M [00:00<00:00, 90.2MB/s]


In [5]:
!unzip sentiment140.zip

Archive:  sentiment140.zip
  inflating: training.1600000.processed.noemoticon.csv  


In [45]:
import pandas as pd

COLUMN_NAMES = ["target", "ids", "date", "flag", "user", "text"]
TWEETS_FILE = "training.1600000.processed.noemoticon.csv"

twitter_df = pd.read_csv(TWEETS_FILE, names=COLUMN_NAMES, encoding='ISO-8859-1')
twitter_df.drop(["ids", "date", "flag", "user"], axis=1, inplace=True)  # removing unimportant columns for our analysis

twitter_df.target = twitter_df.target.apply(lambda x: "positive" if x == 4 else "negative")  # makes the target labels more human readable

# Showing the label distribution
value_counts = twitter_df['target'].value_counts()
print(f"Label Distribution:\nPos: {value_counts.values[1]}\tNeg: {value_counts.values[0]}\n")
print("Negative examples:\n", twitter_df.head(3), "\n")
print("Positive examples:\n", twitter_df.tail(3))

Label Distribution:
Pos: 800000	Neg: 800000

Negative examples:
      target                                               text
0  negative  @switchfoot http://twitpic.com/2y1zl - Awww, t...
1  negative  is upset that he can't update his Facebook by ...
2  negative  @Kenichan I dived many times for the ball. Man... 

Positive examples:
            target                                               text
1599997  positive  Are you ready for your MoJo Makeover? Ask me f...
1599998  positive  Happy 38th Birthday to my boo of alll time!!! ...
1599999  positive  happy #charitytuesday @theNSPCC @SparksCharity...


#### Pre-Processing


In [58]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

nltk.download('stopwords');
nltk.download('punkt')
stop_words = stopwords.words('english')
stemmer = SnowballStemmer('english')

def preprocess(text):
  text_without_emoticons = re.sub(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', '', text)  # remove some emoticons that were still mistakenly left in the dataset
  stripped = re.sub(r'[^\w\s.,]+', '', str(text_without_emoticons).lower()).strip()  # make lowercase and strip weird characters
  
  tokens = []
  for token in word_tokenize(stripped):
    if token not in stop_words:
      tokens.append(token)

  return tokens

# text before and after pre-processing
print("Before: ", twitter_df.loc[420, 'text'])
twitter_df.text.apply(lambda x: preprocess(x))
print("After: ", twitter_df.loc[420, 'text'])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Before:  @SupaMagg that happened to me saturday night. along with my glittery green lighter! 
After:  @SupaMagg that happened to me saturday night. along with my glittery green lighter! 


In [52]:
twitter_df.text[0]

"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

In [55]:
def preprocess(text):


preprocess(twitter_df.text[0])

'switchfoot httptwitpic.com2y1zl  awww, thats a bummer.  you shoulda got david carr of third day to do it.'