### Business Overview

The project develops an AI-driven detection system for tracking influenza, an acute viral infection causing annual epidemics with severe respiratory symptoms, and tuberculosis, a bacterial infection that remains a major public health threat especially in developing nations. Using social media data, the system aims to identify outbreaks early, enabling faster response from WHO and health agencies to reduce disease spread and mortality. CopyRetry

### Problem statement

Early detection of influenza and tuberculosis outbreaks is critical, but official reporting often lags behind the initial spread. Social media data could provide an early warning system, but requires AI-powered analysis to identify and track disease trends before they become full-blown epidemics.

### Objectives

1. Early detection of potential influenza and tuberculosis outbreaks using real-time social media data.
2. Track the spread patterns of these diseases by monitoring symptom-related keywords and geospatial data from social posts.
3. Identify high-risk areas for outbreaks before they are officially reported.
4. Develop predictive models to forecast the trajectory of potential outbreaks.
5. Provide early alerts to public health organizations and government agencies to enable faster response and intervention.

### Proposed Solutions

1. Collect real-time social media data (from platforms like Twitter, Reddit) using APIs and keyword-based filtering.
2. Apply natural language processing (NLP) techniques to detect mentions of disease symptoms, concerns, and outbreak-related keywords.
3. Conduct sentiment analysis to identify posts indicating fear, panic or growing anxiety around potential outbreaks.
4. Leverage machine learning models like SVMs and neural networks to classify social posts as related to disease outbreaks or not.
5. Utilize anomaly detection algorithms to identify unusual spikes in outbreak-related keywords and phrases.
6. Map the geospatial and temporal data to visualize disease spread patterns and high-risk clusters.
7. Validate social media-derived insights against official health reports to ensure accuracy.
8. Develop predictive models to forecast outbreak trajectories and build an automated alert system for public health authorities.

### Metrics of Success

F1 score of the disease outbreak classification model: Target F1 score > 0.85
Precision of the outbreak detection: Target precision > 0.90
AUC-ROC (Area Under Receiver Operating Characteristic) curve: Target AUC-ROC > 0.90


**Accuracy of Outbreak Prediction:**

Accuracy of forecasting models in predicting outbreak magnitude: Target accuracy > 80%
Accuracy of forecasting models in predicting outbreak trajectory: Target accuracy > 75%


**Correlation with Official Data:**

Correlation coefficient between social media-derived disease trends and official case data: Target r > 0.80


**Timeliness of Outbreak Detection:**

Average lead time between social media detection and official reporting: Target lead time > 7 days

### StakeHolders

1. World Health Organization (WHO)
Global health agency responsible for coordinating pandemic preparedness and response

2. National/Regional Public Health Organizations
Disease control centers, epidemiology departments in countries/regions

3. Emergency Response Agencies
Disaster management authorities, emergency medical teams

4. Government Policymakers
Health ministers, legislators responsible for public health policies

5. Healthcare Providers
Hospitals, clinics, and other medical facilities that need early warning

6. Public Health Researchers and Epidemiologists
Academics and analysts studying disease trends and mitigation strategies

7. The General Public
Citizen stakeholders who benefit from faster outbreak response and containment

In [8]:
!pip install tweepy
!pip install textblob
!pip install wordcloud
!pip install emoji

Collecting emoji
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.0-py3-none-any.whl (586 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m586.9/586.9 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.0


### Data Understanding

#### Import necessary Libraries

In [38]:
# manupulation
import nltk
import pandas as pd
import os
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from wordcloud import WordCloud


# nltk
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer


# Download required NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

import emoji
from collections import Counter
from textblob import TextBlob



# !pip install emoji

import warnings

# Ignore all warnings
warnings.filterwarnings('ignore')

# sklearn
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


#### Combine the scraaped Csv files into one cvs file

In [34]:
def combine_csv_files(folder_path, id_column='id', output_file=None):
    """
    Combines all CSV files in a folder by stacking them vertically and adjusting IDs.

    """
    # Convert folder path to Path object
    folder = Path(folder_path)

    # Get all CSV files in the folder
    csv_files = list(folder.glob('*.csv'))

    if not csv_files:
        raise ValueError(f"No CSV files found in {folder_path}")

    # List to store all dataframes
    dfs = []

    # Read and combine all CSV files
    for file in csv_files:
        try:
            # Read CSV with automatic encoding detection
            df = pd.read_csv(file, encoding='utf-8')
            print(f"Successfully read: {file.name}")

            # Add source file column if you want to track which file data came from
            df['source_file'] = file.name

            dfs.append(df)

        except UnicodeDecodeError:
            # Try different encoding if utf-8 fails
            try:
                df = pd.read_csv(file, encoding='latin1')
                print(f"Successfully read (with latin1 encoding): {file.name}")
                df['source_file'] = file.name
                dfs.append(df)
            except Exception as e:
                print(f"Error reading {file.name}: {str(e)}")
                continue
        except Exception as e:
            print(f"Error reading {file.name}: {str(e)}")
            continue

    if not dfs:
        raise ValueError("No CSV files were successfully read")

    # Combine all dataframes
    combined_df = pd.concat(dfs, axis=0, ignore_index=True)

    # Adjust ID column if it exists
    if id_column in combined_df.columns:
        combined_df[id_column] = range(1, len(combined_df) + 1)

    # Add timestamp for when the combination was done
    combined_df['combined_timestamp'] = pd.Timestamp.now()

    # Save combined DataFrame if output path is provided
    if output_file:
        combined_df.to_csv(output_file, index=False)
        print(f"\nCombined data saved to: {output_file}")

    # Print summary
    print(f"\nSummary:")
    print(f"Total files processed: {len(csv_files)}")
    print(f"Total rows in combined dataset: {len(combined_df)}")
    print(f"Columns in combined dataset: {', '.join(combined_df.columns)}")

    return combined_df


def main():
    try:
      # Data Folder Path
        folder_path = "/content/data"
        # Save the combined data nto a csv file
        output_path = "path/to/save/combined_data.csv"

        # Combine files
        combined_data = combine_csv_files(
            folder_path=folder_path,
            id_column='id',

        )


        print("\nData Quality Check:")
        print(f"Null values:\n{combined_data.isnull().sum()}")
        print(f"\nDuplicate rows: {combined_data.duplicated().sum()}")

        # Optional: Display first few rows
        print("\nFirst few rows of combined data:")
        print(combined_data.head())

    except Exception as e:
        print(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    main()

Successfully read: TwiBot-#badflu-20241029_004417.csv
Successfully read: TwiBot-#CDCFlu-20241028_141507.csv

Summary:
Total files processed: 2
Total rows in combined dataset: 200
Columns in combined dataset: id, tweetText, tweetURL, type, tweetAuthor, handle, geo, mentions, hashtags, replyCount, quoteCount, retweetCount, likeCount, views, bookmarkCount, createdAt, allMediaURL, videoURL, source_file, combined_timestamp

Data Quality Check:
Null values:
id                      0
tweetText               0
tweetURL                0
type                    0
tweetAuthor             0
handle                  0
geo                    20
mentions              123
hashtags                0
replyCount              0
quoteCount              0
retweetCount            0
likeCount               0
views                   0
bookmarkCount           0
createdAt               0
allMediaURL           120
videoURL              195
source_file             0
combined_timestamp      0
dtype: int64

Duplicate 

In [35]:
class DataUnderstanding():
    """Class that provides an understanding of a dataset"""

    def __init__(self, data=None):
        """Initialization"""
        self.df = data

    def load_data(self, path):
        """Load the data"""
        if self.df is None:
            self.df = pd.read_csv(path, encoding='latin-1')
        return self.df

    def concat_data(self, other_df):
        """Concatenate the current dataframe with another dataframe vertically"""
        if self.df is not None and other_df is not None:
            self.df = pd.concat([self.df, other_df], axis=0, ignore_index=True)
        return self.df

    def understanding(self):
        """Provides insights into the dataset"""
        # Info
        print("INFO")
        print("-" * 4)
        self.df.info()

        # Shape
        print("\n\nSHAPE")
        print("-" * 5)
        print(f"Records in dataset: {self.df.shape[0]} with {self.df.shape[1]} columns.")

        # Columns
        print("\n\nCOLUMNS")
        print("-" * 6)
        print("Columns in the dataset are:")
        for idx in self.df.columns:
            print(f"- {idx}")

        # Unique Values
        print("\n\nUNIQUE VALUES")
        print("-" * 12)
        for col in self.df.columns:
            print(f"Column {col} has {self.df[col].nunique()} unique values")
            if self.df[col].nunique() < 12:
                print(f"Top unique values in {col} include:")
                for idx in self.df[col].value_counts().index:
                    print(f"- {idx}")
            print("")

        # Missing or Null Values
        print("\nMISSING VALUES")
        print("-" * 15)
        for col in self.df.columns:
            print(f"Column {col} has {self.df[col].isnull().sum()} missing values.")

        # Duplicate Values
        print("\n\nDUPLICATE VALUES")
        print("-" * 16)
        print(f"The dataset has {self.df.duplicated().sum()} duplicated records.")

# Initialize data understanding
data = DataUnderstanding()

# Load the first dataset
data_path1 =  'combined_dataset.csv'
df = data.load_data(data_path1)




# Display the concatenated dataset
df

Unnamed: 0,id,tweetText,tweetURL,type,tweetAuthor,handle,geo,mentions,hashtags,replyCount,quoteCount,retweetCount,likeCount,views,bookmarkCount,createdAt,allMediaURL,videoURL
0,1848108652655231126,#badflu,https://x.com/russmc876/status/184810865265523...,tweet,Logo Designer,@russmc876,"Kingston, Jamaica",,#badflu,0,1,0,1,190,0,2024-10-20 14:06:53,,
1,1821428705962360987,@InfoWars_tv @RealAlexJones @nypost @AJCONN @N...,https://x.com/TippytopshapeU/status/1821428705...,tweet,TippyTop√∞¬ü¬á¬∫√∞¬ü¬á¬≤,@TippytopshapeU,"Westchester County, NY","@InfoWars_tv,@RealAlexJones,@nypost,@AJCONN,@N...",#badflu,0,0,2,0,21,0,2024-08-07 23:10:18,https://pbs.twimg.com/media/GUcDvUfWUAACz_n.jp...,
2,1703455805817671797,#badflu causing #Sickness √∞¬ü¬ò¬∑√∞¬ü¬ò¬∑,https://x.com/Arslan94140887/status/1703455805...,tweet,Arslan,@Arslan94140887,Oman,,"#badflu,#Sickness",0,0,0,0,7,0,2023-09-17 10:08:07,,
3,1689913178216333312,@SkyNews Only the double jabbed plus should be...,https://x.com/169Nomis/status/1689913178216333312,tweet,Nomis 169,@169Nomis,Not in or near London anymore,@SkyNews,#badflu,0,0,1,10,108,0,2023-08-11 01:14:33,,
4,1688608196582133762,@LauraLoomer This debanking is not on. We are ...,https://x.com/adrianhuston/status/168860819658...,tweet,Adrian Huston JP,@adrianhuston,"Belfast, UK",@LauraLoomer,#badflu,0,0,0,1,130,0,2023-08-07 10:49:01,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,649979394442661888,Time for a flu shot . #DCOHealth. #CDCFLU htt...,https://x.com/HealthDouglasCo/status/649979394...,tweet,Douglas County Health Department,@HealthDouglasCo,"Douglas County, Nebraska",,"#DCOHealth,#CDCFLU",0,0,0,0,-,0,2015-10-02 09:08:58,https://pbs.twimg.com/media/CQUw0L5UcAAV322.jpg,
196,648636068909195264,Time to think about your flu shot. The season ...,https://x.com/HealthDouglasCo/status/648636068...,tweet,Douglas County Health Department,@HealthDouglasCo,"Douglas County, Nebraska",,"#CDCflu,#DCOHealth",0,0,0,0,-,0,2015-09-28 16:11:04,,
197,648577942709112832,It√¢¬Ä¬ôs time to start thinking about getting yo...,https://x.com/HealthDouglasCo/status/648577942...,tweet,Douglas County Health Department,@HealthDouglasCo,"Douglas County, Nebraska",,"#DCOHealth,#CDCFlu",0,0,0,0,-,0,2015-09-28 12:20:06,https://pbs.twimg.com/media/CQA2M6NU8AE7M84.jpg,
198,648513373307568128,Flu season is almost here. Make plans to prote...,https://x.com/HealthDouglasCo/status/648513373...,tweet,Douglas County Health Department,@HealthDouglasCo,"Douglas County, Nebraska",,"#DCOHealth,#CDCflu",0,0,0,0,-,0,2015-09-28 08:03:31,,


In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             200 non-null    int64 
 1   tweetText      200 non-null    object
 2   tweetURL       200 non-null    object
 3   type           200 non-null    object
 4   tweetAuthor    200 non-null    object
 5   handle         200 non-null    object
 6   geo            180 non-null    object
 7   mentions       77 non-null     object
 8   hashtags       200 non-null    object
 9   replyCount     200 non-null    int64 
 10  quoteCount     200 non-null    int64 
 11  retweetCount   200 non-null    int64 
 12  likeCount      200 non-null    int64 
 13  views          200 non-null    object
 14  bookmarkCount  200 non-null    int64 
 15  createdAt      200 non-null    object
 16  allMediaURL    80 non-null     object
 17  videoURL       5 non-null      object
dtypes: int64(6), object(12)
memory

In [37]:
# Get an understanding of the dataset
data.understanding()

INFO
----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             200 non-null    int64 
 1   tweetText      200 non-null    object
 2   tweetURL       200 non-null    object
 3   type           200 non-null    object
 4   tweetAuthor    200 non-null    object
 5   handle         200 non-null    object
 6   geo            180 non-null    object
 7   mentions       77 non-null     object
 8   hashtags       200 non-null    object
 9   replyCount     200 non-null    int64 
 10  quoteCount     200 non-null    int64 
 11  retweetCount   200 non-null    int64 
 12  likeCount      200 non-null    int64 
 13  views          200 non-null    object
 14  bookmarkCount  200 non-null    int64 
 15  createdAt      200 non-null    object
 16  allMediaURL    80 non-null     object
 17  videoURL       5 non-null      object
dtypes: int64(6), object(

- Data Structure:
Record Count: 200
Column Count: 18
Memory Usage: 28.2 KB
- Column Details:
ID: Unique identifier with 200 unique values, no missing values.
- Tweet Content:
tweetText: The main text of tweets, with 198 unique values.
tweetURL: URL for each tweet, with 200 unique values.
- Author Information:
tweetAuthor and handle have 98 unique values each, representing the author names and Twitter handles, respectively.
- Geographical Information:
geo has 72 unique values and 20 missing values.
Engagement Metrics:
replyCount, quoteCount, retweetCount, likeCount, views, and bookmarkCount are included to track engagement.
Bookmark count has only one value (0), while the other engagement metrics vary with limited unique values.
Media Content:
- allMediaURL has 80 non-null values, indicating that 40% of the tweets include media.
- videoURL has only 5 non-null entries, suggesting few tweets contain videos.
Other Information:
hashtags: Contains 109 unique values, with no missing data, indicating hashtag diversity.
- mentions: Appears in 77 records and has 67 unique values.
createdAt: Timestamp of each tweet, with 200 unique values.
Missing Values:
Mentions and allMediaURL contain a significant portion of missing values, with mentions missing in 123 entries and allMediaURL in 120.
geo and videoURL have moderate and high missing values, respectively (20 missing values in geo and 195 in videoURL).
Duplicates:
No duplicate records were found in the dataset.