# Background

1. Provide a problem statement or a user story. (Who is your audience in the statement or story ?)


Problem statement: Imagine we are data analytics service provider, our client, a public health organization,want to understand and categorize COVID-19-related tweets in order to gain insights into public sentiment, misinformation, and key topics of discussion, which will help them tailor our communication strategies and public health campaigns more effectively





2. Provide intuitive explanations of ML methodology and interpretation of key metrics.


ML methodology explanation: We use a combination of unsupervised clustering and supervised multilabel classification to categorize tweets into meaningful topics. First, we preprocess the text data by cleaning and transforming it into numerical embeddings. Next, we apply K-Means clustering to group similar tweets together. For each cluster, we generate representative keyword tags using GPT-3. We then train various multilabel classifiers, such as logistic regression, LDA, Gradient Boosting, Random Forest, and MLP, using these keyword tags as ground truth labels. To evaluate the performance of our models, we use metrics like accuracy and Hamming loss; we select the best model with relatively high accuracy and relatively low hamming loss.



3. Model metrics/performance is connected with real-world impact(e.g. Profit, retention…)



Real-world impact(qualitative in my case): It can help the public health organization identify trends, misinformation, and areas of public concern, allowing them to make informed decisions about communication strategies and public health campaigns. This can ultimately lead to increased public awareness, better adherence to safety guidelines, and improved public health outcomes.


In [52]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, RegexpTokenizer
from collections import Counter
import matplotlib.pyplot as plt
#nltk.download('stopwords')
#nltk.download('vader_lexicon')




# Preprocessing Helper Functions
def preprocess_text(text: str) -> str:
    
    """
    Processes a tweet string by removing any weird string characters/formattings
    Args: 
        - text (str): the text to clean
    Returns: 
        - clean_text (str): the cleaned text string
    """
    # convert to lowercase
    text = text.lower()
    
    # remove URLs
    text = re.sub(r"http\S+", "", text)

    # Removing Emojis
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    
    # Removing emoticons
    text = re.sub(r':\w+:', '', text)
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    
    # Removing Contractions
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"won\'t", "will not", text)
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    
    clean_text = text
    
    return clean_text

def preprocess_nulls(df: pd.DataFrame) -> pd.DataFrame: 
    
    """
    Removes nulls and 0 counts from a dataframe
    Args: 
        - df (pd.DataFrame): the dataframe to remove nulls from
    Returns: 
        - clean_df (str): the cleaned df
    """
    
    # Drop duplicate rows 
    df = df.drop_duplicates(subset = "text")
    
    # Drop rows with no followers 
    df = df[df['user_followers'] > 0]
    
    # Drop nulls and reset index 
    df = df.dropna().reset_index(drop = True)
    
    clean_df = df
    
    return clean_df

def preprocess_df(df: pd.DataFrame) -> pd.DataFrame:
    
    """
    Main processing function on the dataframe
    Args: 
        - df (pd.DataFrame): df of tweets to process
    Returns: 
        - preprocessed_df (pd.DataFrame): the processed df
    """
    
    # Preprocess null and missing values 
    df = preprocess_nulls(df)
    
    # Preprocess text 
    df['processed_text'] = df['text'].apply(preprocess_text)            
    
    return df

# EDA Dashboard

In [64]:
preprocessed_df.head(5)

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet,processed_text
0,Time4fisticuffs,"Pewee Valley, KY",#Christian #Catholic #Conservative #Reagan #Re...,2009-02-28 18:57:41,9275,9525,7254,False,2020-07-25 12:27:14,@diane3443 @wdunlap @realDonaldTrump Trump nev...,['COVID19'],Twitter for Android,False,@diane3443 @wdunlap @realdonaldtrump trump nev...
1,ethel mertz,Stuck in the Middle,#Browns #Indians #ClevelandProud #[]_[] #Cavs ...,2019-03-07 01:45:06,197,987,1488,False,2020-07-25 12:27:10,@brookbanktv The one gift #COVID19 has give me...,['COVID19'],Twitter for iPhone,False,@brookbanktv the one gift #covid19 has give me...
2,DIPR-J&K,Jammu and Kashmir,🖊️Official Twitter handle of Department of Inf...,2017-02-12 06:45:15,101009,168,101,False,2020-07-25 12:27:08,25 July : Media Bulletin on Novel #CoronaVirus...,"['CoronaVirusUpdates', 'COVID19']",Twitter for Android,False,25 july : media bulletin on novel #coronavirus...
3,🎹 Franz Schubert,Новоро́ссия,🎼 #Новоро́ссия #Novorossiya #оставайсядома #S...,2018-03-19 16:29:52,1180,1071,1287,False,2020-07-25 12:27:06,#coronavirus #covid19 deaths continue to rise....,"['coronavirus', 'covid19']",Twitter Web App,False,#coronavirus #covid19 deaths continue to rise....
4,hr bartender,"Gainesville, FL",Workplace tips and advice served up in a frien...,2008-08-12 18:19:49,79956,54810,3801,False,2020-07-25 12:27:03,How #COVID19 Will Change Work in General (and ...,"['COVID19', 'Recruiting']",Buffer,False,how #covid19 will change work in general (and ...


In [68]:
import ipywidgets as widgets
from IPython.display import display
import pandas as pd
import re
from wordcloud import WordCloud
from collections import Counter


file_path = input("Please enter the path to your CSV file: ")
df = pd.read_csv(file_path)

print("\nData is loaded successfully")

# Preprocess dataset
preprocessed_df = preprocess_df(df)
print("Your data is ready for analysis.")


# Dropdown menu to choose the plot
plot_options = ['Bar Plot of Most Common Words in Tweets', 'Distribution of Length of Tweets', 'Time-series Plot of Tweet Counts', 'Word Cloud of Most Common Words']
plot_dropdown = widgets.Dropdown(
    options=plot_options,
    value=plot_options[0],
    description='Select Plot:',
)

# Dropdown menu to choose the country
country_options = ['All Countries', 'United States', 'Canada', 'South Africa','Switzerland','London','India','United Kingdom']
country_dropdown = widgets.Dropdown(
    options=country_options,
    value=country_options[0],
    description='Select Country:',
)

# Date range picker
start_date_picker = widgets.DatePicker(
    description='Start Date',
    disabled=False
)

end_date_picker = widgets.DatePicker(
    description='End Date',
    disabled=False
)


# Button to process the dataset and generate the plot
process_button = widgets.Button(
    description='Plot',
    tooltip='Plot',
)

# Output widget to display the result
output = widgets.Output()


def on_button_click(b):
    with output:
        output.clear_output()

        # Filter tweets by country
        selected_country = country_dropdown.value
        if selected_country != 'All Countries':
            filtered_df = preprocessed_df.loc[preprocessed_df['user_location'] == selected_country]
        else:
            filtered_df = preprocessed_df

        
        filtered_df['date'] = pd.to_datetime(filtered_df['date'])

        start_date = start_date_picker.value
        end_date = end_date_picker.value
        if start_date and end_date:
           start_date = pd.to_datetime(start_date)
           end_date = pd.to_datetime(end_date)
           filtered_df = filtered_df[(filtered_df['date'] >= start_date) & (filtered_df['date'] <= end_date)]
        
        # Plot selected graph
        selected_plot = plot_dropdown.value

        if selected_plot == 'Bar Plot of Most Common Words in Tweets':
            # code for bar plot
            text = " ".join(filtered_df['processed_text'])
            words = text.split()
            words_counter = Counter(words)
            most_common_words = words_counter.most_common(20)

            words = [word[0] for word in most_common_words]
            counts = [word[1] for word in most_common_words]

            plt.bar(words, counts)
            plt.xlabel('Words')
            plt.ylabel('Frequency')
            plt.title('Bar Plot of Most Common Words in Tweets')
            plt.xticks(rotation=90)
            plt.show()

        elif selected_plot == 'Distribution of Length of Tweets':
            # code for length distribution plot
            filtered_df['text_length'] = filtered_df['text'].apply(len)
            filtered_df['text_length'].plot.hist(bins=30, rwidth=0.9)
            plt.xlabel('Length of Tweets')
            plt.ylabel('Counts')
            plt.title('Distribution of Length of Tweets')
            plt.show()
            
        elif selected_plot == 'Time-series Plot of Tweet Counts':
            # code for time-series plot
            filtered_df['date'] = pd.to_datetime(filtered_df['date'])
            df_grouped = filtered_df.groupby(filtered_df['date'].dt.date).count()
            fig, ax = plt.subplots()
            ax.plot(df_grouped.index, df_grouped['text'])
            ax.set_ylabel('Number of Tweets')
            ax.set_title('Time-series Plot of Tweet Counts')
            plt.xticks(rotation=90)
            plt.show()
            
        elif selected_plot == 'Word Cloud of Most Common Words':
            # code for word cloud
            text = " ".join(filtered_df['processed_text'])
            words = text.split()
            words_counter = Counter(words)
            wordcloud = WordCloud(width=800, height=400).generate_from_frequencies(words_counter)
            plt.figure(figsize=(10, 5))
            plt.imshow(wordcloud, interpolation='bilinear')
            plt.axis("off")
            plt.title('Word Cloud of Most Common Words')
            plt.show()
            
        else: # 'Word Cloud of Most Common Words by Location'
            top_words_by_location = {}
            for location in filtered_df['user_location'].unique():
                location_df = filtered_df.loc[filtered_df['user_location'] == location]
                text = " ".join(location_df['processed_text'])
                words = text.split()
                words_counter = Counter(words)
                most_common_words = words_counter.most_common(20)
                top_words_by_location[location] = most_common_words

            # Plot word cloud for each location
            for location, top_words in top_words_by_location.items():
                words = [word[0] for word in top_words]
                frequencies = [word[1] for word in top_words]
                wordcloud = WordCloud(width=800, height=400).generate_from_frequencies(dict(zip(words, frequencies)))
                plt.figure(figsize=(10, 5))
                plt.imshow(wordcloud, interpolation='bilinear')
                plt.axis("off")
                plt.title(f'Most Common Words in Tweets from {location}')
                plt.show()

process_button.on_click(on_button_click)

#Display widgets

display(country_dropdown)
display(start_date_picker)
display(end_date_picker)
display(plot_dropdown)
display(process_button)
display(output)


Please enter the path to your CSV file: covid2020.csv

Data is loaded successfully
Your data is ready for analysis.


Dropdown(description='Select Country:', options=('All Countries', 'United States', 'Canada', 'South Africa', '…

DatePicker(value=None, description='Start Date')

DatePicker(value=None, description='End Date')

Dropdown(description='Select Plot:', options=('Bar Plot of Most Common Words in Tweets', 'Distribution of Leng…

Button(description='Plot', style=ButtonStyle(), tooltip='Plot')

Output()