# Twitter User Analysis

This project aims to understand the behavior of a user through their tweets and interaction with other users on a popular social media platform called [twitter](https://twitter.com). I reached out to a friend of mine [Shriram Dusane](https://twitter.com/shriramdusane) and he nodded his consent for the data to be used for the sake of this analysis.

I scraped the tweets using this tool from github called [twint](https://github.com/twintproject/twint) which gave me all the data I needed for the sake of this analysis. You can look up the link attached above in order to explore the tool but for those who're short on patience, you can open up your anaconda prompt and type in the following commands to extract a user's info using twint.
```
pip3 install twint
twint -u username -o file.json --json
```

In place of the username, type in the user's name i.e. @shriramdusane or @xyz and in place of file, type in the name of the file you want to save the tweets as. The file will be stored in the destination from where you're running the anaconda prompt.

Post getting the tweets, we need to clean and organize the data in a particular way for further analysis in tableau. Let's get straight to it.

In [1]:
# Import all necessary libraries

# For dataframe related opns
import numpy as np
import pandas as pd

# To read tweets from the json file
import json

# To view random slices of data from the dataframe
import random

# To generate a wordcloud out of hashtags
from wordcloud import WordCloud, STOPWORDS

# To read images
from PIL import Image

# For counting occurences of hashtags
from collections import Counter

# For handling text and preprocessing it
import re
import regex
import emoji

# To supress unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

# To view/display complete data inline instead of truncated data
pd.set_option("display.max_colwidth", -1)

In [2]:
f = './all_tweets.json'
x = []
with open(f, encoding = 'utf-8') as f:
    x.append(f.readlines())

In [3]:
# View one tweet item from the json file
print(json.loads(x[0][1]))

{'id': 1220042301159833605, 'conversation_id': '1219976785598861312', 'created_at': 1579715726000, 'date': '2020-01-22', 'time': '23:25:26', 'timezone': 'India Standard Time', 'user_id': 1587976986, 'username': 'shriramdusane', 'name': 'Shriram Dusane', 'place': '', 'tweet': '@YugaBhat makes extremely delicious pasta 🍝', 'mentions': ['kushalvala', 'yugabhat'], 'urls': [], 'photos': [], 'replies_count': 1, 'retweets_count': 0, 'likes_count': 3, 'hashtags': [], 'cashtags': [], 'link': 'https://twitter.com/shriramdusane/status/1220042301159833605', 'retweet': False, 'quote_url': '', 'video': 0, 'near': '', 'geo': '', 'source': '', 'user_rt_id': '', 'user_rt': '', 'retweet_id': '', 'reply_to': [{'user_id': '1587976986', 'username': 'shriramdusane'}, {'user_id': '1012809324', 'username': 'kushalvala'}, {'user_id': '168484220', 'username': 'YugaBhat'}], 'retweet_date': '', 'translate': '', 'trans_src': '', 'trans_dest': ''}


In [4]:
# Read data from json elements and structure them to be converted into a dataframe
records = []
for i in x[0]:
    record = {}
    data= (json.loads(i))
    record['Date'] =pd.to_datetime(data['date'] + ' ' + data['time'])
    record['TZ'] = data['timezone']
    # Since we're getting links in urls, remove them from text
    record['Tweet'] = re.sub(r'http[s]?://\S+', '', data['tweet'])
    record['Tweet'] = re.sub(r'pic.twitter.com/\w+','', record['Tweet'])
    
    record['Word_count'] = len(data['tweet'].split())
    record['Character_count'] = len(data['tweet'])
    record['Mentions'] = data['mentions']
    record['Replies_Count'] = data['replies_count']
    record['Retweets_Count'] = data['retweets_count']
    record['Likes'] = data['likes_count']
    record['Hashtags'] = data['hashtags']
    record['Link'] = data['link']
    record['photos'] = data['photos']
    record['urls'] = data['urls']
    records.append(record)
data = pd.DataFrame(records)

In [5]:
# https://stackoverflow.com/questions/43146528/how-to-extract-all-the-emojis-from-text
# Function to extract emojis from text
def split_count(text):
    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):
            emoji_list.append(word)
    return emoji_list

In [6]:
# Function to remove '#' sign from the hashtags
def remove_hash_sign(items):
    hashes = []
    for i in items:
        hashes.append(re.sub(r'#','', i))
    return hashes

In [7]:
# Operate the functions defined above on the dataframe
data['Emojis'] = data['Tweet'].apply(lambda x: split_count(x))
data['Hashtags'] = data['Hashtags'].apply(lambda x: remove_hash_sign(x))

In [8]:
# Count the number of emojis and create a field for the same
# Save the data to a csv file
data['Emoji_Count'] = data.Emojis.apply(lambda x: len(x))
data.to_csv('Shrirams_Tweets.csv', header = True, index=False)

In [9]:
# Randomly subset 5 elements from the array
elems = [random.choice(data.index) for i in range(5)]
data.iloc[elems,]

Unnamed: 0,Date,TZ,Tweet,Word_count,Character_count,Mentions,Replies_Count,Retweets_Count,Likes,Hashtags,Link,photos,urls,Emojis,Emoji_Count
1055,2019-04-30 22:17:31,India Standard Time,Thanks for eating a piece of cake when I offered.,10,49,"[anantikamehra, laldabba]",0,0,1,[],https://twitter.com/shriramdusane/status/1123267648303722496,[],[],[],0
7460,2014-06-23 22:27:08,India Standard Time,@brokntransistor yea man.. I always fuck up without the tuner 😐,11,63,[brokntransistor],1,0,0,[],https://twitter.com/shriramdusane/status/481118770430504960,[],[],[😐],1
761,2019-09-01 00:17:19,India Standard Time,Ae nahi yaar! Mazaa aa raha hai 👌🏻😂,8,35,[sassthree],0,0,0,[],https://twitter.com/shriramdusane/status/1167871504542945280,[],[],"[👌🏻, 😂]",2
3610,2017-08-21 21:32:54,India Standard Time,Thanks 😛,2,8,"[pskylarke, astikkulkarni, youtube]",0,0,0,[],https://twitter.com/shriramdusane/status/899663109006123008,[],[],[😛],1
2599,2018-01-28 00:08:10,India Standard Time,Your ability to miss the point is magnificently foregrounded in most of the discussions.,14,88,"[imanuraagsharma, factsionary]",1,0,1,[],https://twitter.com/shriramdusane/status/957321854359814144,[],[],[],0


In [10]:
# Function to create a wordcloud from the given image and text
def create_word_cloud(string, year, pic_loc, dest, transparency = 0.5):
    # Background on which to overlay the wordcloud
    background = np.array(Image.open(pic_loc))
    
    # Generation of wordcloud, specify color, maximum number of words, stopwords etc.
    cloud = WordCloud(background_color = "white", max_words = 200, mask = background, stopwords = set(STOPWORDS))
    cloud.generate(string)
    cloud.to_file(dest)
    
    # Read the recently created wordcloud image and make it ready for 
    # merging/overlaying
    overlay = Image.open(dest)
    background = Image.open(pic_loc).convert("RGBA")
    overlay = overlay.convert("RGBA")

    # Overlay the wordcloud on the background image and save it
    new_img = Image.blend(overlay, background, transparency)
    new_img.save(dest, "PNG")

In [11]:
# Get a list of all years in order to iterate over
years = data.Date.dt.year.unique()

In [12]:
# Make a hashtag word cloud for every year
all_hashes = {}
for year in years:
    hashtags = []
    for idx in data.index:
        if data.iloc[idx, 0].year == year:
            for i in data.iloc[idx,9]:
                hashtags.append(i)
    all_hashes[year] = " ".join(hashtags)
    create_word_cloud(all_hashes[year], year, f"shriram_{year}.jpg", f"{year}_Hashcloud.png")

In [13]:
# Extract and save all the hashtags yearwise in a csv file
yearwise_hashtags = {}
for year in years:
    c = Counter()
    for idx in data.index:
        if data.iloc[idx, 0].year == year:
            for i in data.iloc[idx,13]:
                c[i] += 1
    yearwise_hashtags[year] = c
    
records = []

for year in years:
    for k, v in dict(yearwise_hashtags[year]).items():
        records.append([k, v, year])
hashes = pd.DataFrame(records, columns = ['Hashtag', 'Count', 'Year'])       
hashes.to_csv("Yearwise_Hashtags.csv", header = True, index = False)

In [14]:
# Extract and save all the mentions to a file (yearwise)
yearwise_mentions = {}
for year in years:
    c = Counter()
    for idx in data.index:
        if data.iloc[idx, 0].year == year:
            for i in data.iloc[idx,5]:
                c[i] += 1
    yearwise_mentions[year] = c
    
records = []

for year in years:
    for k, v in dict(yearwise_mentions[year]).items():
        records.append([k, v, year, 'https://twitter.com/' + k])
hashes = pd.DataFrame(records, columns = ['Mention', 'Count', 'Year', 'User URL'])       
hashes.to_csv("Yearwise_Mentions.csv", header = True, index = False)