## Twitter Analysis with python and pandas
This is a rework of an existing sentiment analysis project. It has been modified for the sake of simplicity.

Author of original project: KROUDIR Amir

Github:
- Profile: https://github.com/kroudir
- Project: https://github.com/kroudir/Twitter-Sentiment-Analysis-with-python/blob/master/Project_notebook.ipynb


### 1) Data Access

1. Let’s load the libraries which will be used in this project.

In [None]:
import re    # for regular expressions 
import nltk  # for text manipulation 
import warnings 
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt  

pd.set_option("display.max_colwidth", 200) 
warnings.filterwarnings("ignore", category=DeprecationWarning) 

%matplotlib inline

Let’s read the dataset into a pandas data frame.

In [None]:
data  = pd.read_csv('TweetsElonMusk.csv') 


In [None]:
data.head(10)

### 2) Data Inspection


Firstly - let’s check dimensions of the dataset.

In [None]:
data.shape # gives back the shape of the data frame (number of columns and rows)


The dataset has 12,562 tweets and 36 attributes.

Let’s have a glance at the different attributes.

In [None]:
data.columns # gives us all column names

Let’s check out the text of some tweets, which should be in the "tweet" column.

In [None]:

data["tweet"].head(10)



In [None]:
top10 = data.sort_values(by="retweets_count",ascending=False).head(10)
top10["tweet"]

Now we will check the distribution of length of the tweets, in terms of words.

In [None]:
length_data = data['tweet'].str.len() 
plt.hist(length_data, bins=20, label="data_tweets") 
plt.legend()
plt.show()

### 3) Data Cleaning


In any natural language processing task, cleaning raw text data is an important step. It helps in getting rid of the unwanted words and characters which helps in obtaining better features. If we skip this step then there is a higher chance that you are working with noisy and inconsistent data. The objective of this step is to clean noise those are less relevant to find the sentiment of tweets such as punctuation, special characters, numbers, and terms which don’t carry much weightage in context to the text.

Given below is a user-defined function to remove unwanted text patterns from the tweets.

In [None]:
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
    return input_txt

We will be following the steps below to clean the raw tweets in our data.

1. We will remove the twitter handles as they are already masked as @user due to privacy concerns. These twitter handles hardly give any information about the nature of the tweet.

2. We will also get rid of the punctuations, numbers and even special characters since they wouldn’t help in differentiating different types of tweets.

3. Most of the smaller words do not add much value. For example, ‘pdx’, ‘his’, ‘all’. So, we will try to remove them as well from our data.

4. Lastly, we will normalize the text data. For example, reducing terms like loves, loving, and lovable to their base word, i.e., ‘love’.are often used in the same context. If we can reduce them to their root word, which is ‘love’. It will help in reducing the total number of unique words in our data without losing a significant amount of information.

#### 1. Removing Twitter Handles (@user)

Let’s create a new column tidy_tweet, it will contain the cleaned and processed tweets. Note that we have passed “@[]*” as the pattern to the remove_pattern function. It is actually a regular expression which will pick any word starting with ‘@’.

In [None]:
data['tidy_tweet'] = np.vectorize(remove_pattern)(data['tweet'], "@[\w]*") 
data.head(10)

#### 2. Removing Punctuations, Numbers, and Special Characters

Here we will replace everything except characters and hashtags with spaces. The regular expression “[^a-zA-Z#]” means anything except alphabets and ‘#’.

In [None]:
data['tidy_tweet'] = data['tidy_tweet'].str.replace("[^a-zA-Z#]", " ") 
data['tidy_tweet'].head(10)



#### 3. Removing Short Words

We have to be a little careful here in selecting the length of the words which we want to remove. So, I have decided to remove all the words having length 3 or less. For example, terms like “hmm”, “oh” are of very little use. It is better to get rid of them.

In [None]:
data['tidy_tweet'] = data['tidy_tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

In [None]:
data.head()

You can see the difference between the raw tweets and the cleaned tweets (tidy_tweet) quite clearly. Only the important words in the tweets have been retained and the noise (numbers, punctuations, and special characters) has been removed.

#### 4. Text Normalization

Here we will use nltk’s PorterStemmer() function to normalize the tweets. But before that we will have to tokenize the tweets. Tokens are individual terms or words, and tokenization is the process of splitting a string of text into tokens.

In [None]:
tokenized_tweet = data['tidy_tweet'].apply(lambda x: x.split()) # tokenizing 
tokenized_tweet.head()

Now we can normalize the tokenized tweets.

In [None]:
from nltk.stem.porter import * 
stemmer = PorterStemmer() 
# stemming
tokenized_tweet = tokenized_tweet.apply(lambda x: [stemmer.stem(i) for i in x]) 

In [None]:
tokenized_tweet.head()

Now let’s stitch these tokens back together. It can easily be done using nltk’s MosesDetokenizer function.

In [None]:
data['tidy_tweet'] = tokenized_tweet.apply(lambda x: ' '.join(x))


In [None]:
data.head()

### 4) Story Generation and Visualization of Tweets

In this section, we will explore the cleaned tweets. Exploring and visualizing data, no matter whether its text or any other data, is an essential step in gaining insights.

Before we begin exploration, we must think and ask questions related to the data in hand. A few probable questions are as follows:

- What are the most common words in the entire dataset?
- What are the most common words in the dataset for negative and positive tweets, respectively?
- How many hashtags are there in a tweet?
- Which trends are associated with my dataset?


#### Understanding the common words used in the tweets: WordCloud

Now I want to see how the given work are distributed across the data dataset. One way to accomplish this task is by understanding the common words by plotting wordclouds.

A wordcloud is a visualization wherein the most frequent words appear in large size and the less frequent words appear in smaller sizes.

Let’s visualize all the words our data using the wordcloud plot.

In [None]:
from wordcloud import WordCloud 

In [None]:
all_words = ' '.join([text for text in data['tidy_tweet']]) 
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words) 
plt.figure(figsize=(10, 7)) 
plt.imshow(wordcloud, interpolation="bilinear") 
plt.axis('off') 
plt.show()

#### Understanding the impact of Hashtags on tweets sentiment

Hashtags in twitter are synonymous with the ongoing trends on twitter at any particular point in time. We should try to check whether these hashtags add any value to our sentiment analysis task.

In [None]:
# function to collect hashtags 
def hashtag_extract(x):    
    hashtags = []    
    # Loop over the words in the tweet    
    for i in x:        
        ht = re.findall(r"#(\w+)", i)        
        hashtags.append(ht)     
    
    return hashtags

In [None]:
# extracting hashtags
HT_regular = hashtag_extract(data['tidy_tweet']) 

# unnesting list 
HT_regular = sum(HT_regular,[]) 


Now that we have prepared our lists of hashtags for both the sentiments, we can plot the top ‘n’ hashtags. So, first let’s check the hashtags.

In [None]:
a = nltk.FreqDist(HT_regular) 
d = pd.DataFrame({'Hashtag': list(a.keys()),'Count': list(a.values())}) 
# selecting top 20 most frequent hashtags
d = d.nlargest(columns="Count", n = 15) 
plt.figure(figsize=(30,5)) 
ax = sns.barplot(data=d, x= "Hashtag", y = "Count") 
ax.set(ylabel = 'Count') 
plt.show()

In the next step we can also plot a word cloud for the hashtags:

In [None]:

all_tags = ' '.join(HT_regular) # create a common string

wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=90).generate(all_tags) 
plt.figure(figsize=(10, 7)) 
plt.imshow(wordcloud, interpolation="bilinear") 
plt.axis('off') 
plt.show()