# Workshop 01

- Name: Ran Arino
- Student ID: 153073200
- Email: rarino@myseneca.ca
- Course: Social Media Analytics
- Course ID: BDA600NAA.07578.2241
- Professor: Dr. Pantea Koochemeshkian

In [33]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from sklearn.feature_extraction.text import TfidfVectorizer

### Part 1

In [2]:
# load data set
data = pd.read_csv("full-corpus-x.csv")
data.head()

Unnamed: 0,TweetText
0,Now all @Apple has to do is get swype on the i...
1,@Apple will be adding more carrier support to ...
2,Hilarious @youtube video - guy does a duet wit...
3,@RIM you made it too easy for me to switch to ...
4,I just realized that the reason I got into twi...


In [34]:
def clean_texts(raw_texts: list or np.array):
    # define result
    result = []

    # set of stopwords
    stop_words = set(stopwords.words('english'))

    # traversing all sentences
    for sent in raw_texts:
        # tokenize
        tokens = word_tokenize(sent)
        # defined cleaned sentence
        clean_sent = ''
        # cleaning each sentence
        for w in tokens:
            # if 'w' is one of punctuations, skip to the next word
            if w in string.punctuation:
                continue
            # if 'w' is one of stop words, skip to the next word
            if w.lower() in stop_words:
                continue
            # add words without stemming
            clean_sent += w.lower() + ' '

        # add clean_sent to result (make sure that the last item is always blank)
        result += [clean_sent[:-1]]

    return result

# get the clean tweet data as list
sent_list = clean_texts(data['TweetText'].values)
sent_list[:5]

['apple get swype iphone crack iphone',
 'apple adding carrier support iphone 4s announced',
 "hilarious youtube video guy duet apple 's siri pretty much sums love affair http //t.co/8exbnqjy",
 'rim made easy switch apple iphone see ya',
 'realized reason got twitter ios5 thanks apple']

In [42]:
# create TF-IDF vectrizer
tfidf_vect = TfidfVectorizer(max_features=50)
# fit and transform data
matrix = tfidf_vect.fit_transform(sent_list)
# get the words based on the tfidf score
words = tfidf_vect.get_feature_names_out()
# get the tfidf score
scores = matrix.toarray()
# combine words and scores
word_scores = list(zip(words, scores.sum(axis=0)))
# sort words based on its scores
sorted_words = sorted(word_scores, key=lambda x: x[1], reverse=True)
# get the top 50 words
top50_words = [word[0] for word in sorted_words[:50]]
print(top50_words)

['twitter', 'microsoft', 'http', 'co', 'google', 'apple', 'rt', 'android', 'de', 'en', 'new', 'que', 'samsung', 'nexus', 'iphone', 'facebook', 'el', 'windows', 'get', 'galaxy', 'sandwich', 'via', 'phone', 'like', 'cream', 'ice', 'ics', 'la', 'siri', 'un', 'steve', 'ballmer', 'app', 'store', 'ios5', 'time', 'icecreamsandwich', 'es', 'nexusprime', 'para', 'one', '4s', 'se', 'great', 'cloud', 'galaxynexus', 'con', 'yahoo', 'video', 'por']


In [45]:
# However, the words with only two characters are likely to have less information and importance.
#  So, try to get the top 50 important words by filtering the word length

# sort words by scores + filter word length
sorted_words_2 = sorted([(w, s) for w, s in word_scores if len(w) > 2], key=lambda x: x[1], reverse=True)
# get the top 50 words
top50_words_2 = [word[0] for word in sorted_words_2[:50]]
print(top50_words_2)

['twitter', 'microsoft', 'http', 'google', 'apple', 'android', 'new', 'que', 'samsung', 'nexus', 'iphone', 'facebook', 'windows', 'get', 'galaxy', 'sandwich', 'via', 'phone', 'like', 'cream', 'ice', 'ics', 'siri', 'steve', 'ballmer', 'app', 'store', 'ios5', 'time', 'icecreamsandwich', 'nexusprime', 'para', 'one', 'great', 'cloud', 'galaxynexus', 'con', 'yahoo', 'video', 'por']


<b>Observations</b>

- The retrieved tweets data indicates the technology-related discussions, such as several iPhone producers, the jargon about the iPhone, and famous tech companies.
- However, there are some curious words, including "ice", "cream", and "sandwich"; we need to conduct further research on how they are related to the technology topics.