# How to Use the System ?

__Follow these steps to use the system on your computer:__



__Step 1: Install Python and Required Libraries__


In [1]:
# pip install pandas
import pandas as pd
# This will install the pandas library, which is used for 
# handling and processing the dataset.

__Step 2: Prepare the Dataset__
- Ensure that the dataset is in a .tsv (Tab-Separated Values) format. The system expects a dataset similar to correct_twitter_201904.tsv, with a structure containing columns like text, created_at, author_id, like_count, and place_id.

- The dataset should be loaded into a pandas DataFrame using the following code:

In [6]:
df = pd.read_csv('correct_twitter_201904.tsv', sep='\t')
df

Unnamed: 0,id,event,ts1,ts2,from_stream,directly_from_stream,from_search,directly_from_search,from_quote_search,directly_from_quote_search,...,retweeted,retweeted_author_id,retweeted_handle,retweeted_follower_count,mentioned_author_ids,mentioned_handles,hashtags,urls,media_keys,place_id
0,1131594960443199488,britney_201904,2022-02-28 09:34:44.627023-05:00,2022-02-28 09:34:44.627023-05:00,True,True,False,False,False,False,...,1130917791752757254,3042894016,Iesbwian,22760,,,,,,
1,1131594976750653440,britney_201904,2022-02-28 09:34:44.626921-05:00,2022-02-28 09:34:44.626921-05:00,True,True,False,False,False,False,...,,,,,,,,,,
2,1131589737955942405,britney_201904,2022-02-28 09:34:44.634058-05:00,2022-02-28 09:34:44.634058-05:00,True,True,False,False,False,False,...,,,,,,,,,,
3,1131594909469892610,britney_201904,2022-02-28 09:34:44.627125-05:00,2022-02-28 09:34:44.627125-05:00,True,True,False,False,False,False,...,1130917791752757254,3042894016,Iesbwian,22760,,,,,,
4,1131594812694511617,britney_201904,2022-02-28 09:34:44.627227-05:00,2022-02-28 09:34:44.627227-05:00,True,True,False,False,False,False,...,1130917791752757254,3042894016,Iesbwian,22760,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88032,1122977274352082944,britney_201904,2022-02-28 09:47:43.066500-05:00,2022-02-28 09:47:43.066500-05:00,True,True,False,False,False,False,...,1122975837500989440,2934938957,STORMISKELLYBAG,13773,,,,,,
88033,1122977257969127429,britney_201904,2022-02-28 09:47:43.066605-05:00,2022-02-28 09:47:43.066605-05:00,True,True,False,False,False,False,...,1118942889785085958,613702825,keanuorange,1422,,,,,,
88034,1122977009347518466,britney_201904,2022-02-28 09:47:43.066708-05:00,2022-02-28 09:47:43.066708-05:00,True,True,False,False,False,False,...,,,,,,,,,,
88035,1122976878812442626,britney_201904,2022-02-28 09:47:43.066810-05:00,2022-02-28 09:47:43.066810-05:00,True,True,False,False,False,False,...,1122898327895519233,846864187015888896,TheBlastNews,11533,,,,,,


__Step 3: Cleaning The Data Set or analyzing Dataset

- Data cleaning or data cleansing is the process of detecting, correcting or removing corrupt or inaccurate records from a data-set,table or database

- It also refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying or deleting the dirty data

In [11]:
df.columns

Index(['id', 'event', 'ts1', ' ts2', 'from_stream', 'directly_from_stream',
       'from_search', 'directly_from_search', 'from_quote_search',
       'directly_from_quote_search', 'from_convo_search',
       'directly_from_convo_search', 'from_timeline_search',
       'directly_from_timeline_search', 'text', 'lang', 'author_id',
       'author_handle', 'created_at', 'conversation_id', 'possibly_sensitive',
       'reply_settings', 'source', 'author_follower_count', 'retweet_count',
       'reply_count', 'like_count', 'quote_count', 'replied_to',
       'replied_to_author_id', 'replied_to_handle',
       'replied_to_follower_count', 'quoted', 'quoted_author_id',
       'quoted_handle', 'quoted_follower_count', 'retweeted',
       'retweeted_author_id', 'retweeted_handle', 'retweeted_follower_count',
       'mentioned_author_ids', 'mentioned_handles', 'hashtags', 'urls',
       'media_keys', 'place_id'],
      dtype='object')

In [12]:
# Summary of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88037 entries, 0 to 88036
Data columns (total 46 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   id                             88037 non-null  int64 
 1   event                          88037 non-null  object
 2   ts1                            88037 non-null  object
 3    ts2                           88037 non-null  object
 4   from_stream                    88037 non-null  bool  
 5   directly_from_stream           88037 non-null  bool  
 6   from_search                    88037 non-null  bool  
 7   directly_from_search           88037 non-null  bool  
 8   from_quote_search              88037 non-null  bool  
 9   directly_from_quote_search     88037 non-null  bool  
 10  from_convo_search              88037 non-null  bool  
 11  directly_from_convo_search     88037 non-null  bool  
 12  from_timeline_search           88037 non-null  bool  
 13  d

In [13]:
# identify or cheking for missing value
df.isnull().sum()

id                               0
event                            0
ts1                              0
 ts2                             0
from_stream                      0
directly_from_stream             0
from_search                      0
directly_from_search             0
from_quote_search                0
directly_from_quote_search       0
from_convo_search                0
directly_from_convo_search       0
from_timeline_search             0
directly_from_timeline_search    0
text                             0
lang                             0
author_id                        0
author_handle                    0
created_at                       0
conversation_id                  0
possibly_sensitive               0
reply_settings                   0
source                           0
author_follower_count            0
retweet_count                    0
reply_count                      0
like_count                       0
quote_count                      0
replied_to          

In [14]:
# Statistical Summary
df.describe(include='all')

Unnamed: 0,id,event,ts1,ts2,from_stream,directly_from_stream,from_search,directly_from_search,from_quote_search,directly_from_quote_search,...,retweeted,retweeted_author_id,retweeted_handle,retweeted_follower_count,mentioned_author_ids,mentioned_handles,hashtags,urls,media_keys,place_id
count,88037.0,88037,88037,88037,88037,88037,88037,88037,88037,88037,...,88037.0,88037.0,88037.0,88037.0,88037.0,88037.0,88037.0,88037.0,88037.0,88037.0
unique,,1,88037,88037,1,2,1,1,1,1,...,4163.0,2751.0,2751.0,2712.0,1.0,1.0,2317.0,21731.0,6555.0,599.0
top,,britney_201904,2022-02-28 09:34:44.627023-05:00,2022-02-28 09:34:44.627023-05:00,True,True,False,False,False,False,...,,,,,,,,,,
freq,,88037,1,1,88037,84223,88037,88037,88037,88037,...,36508.0,36508.0,36508.0,36508.0,88037.0,88037.0,75875.0,46396.0,71525.0,87016.0
mean,1.128375e+18,,,,,,,,,,...,,,,,,,,,,
std,3001479000000000.0,,,,,,,,,,...,,,,,,,,,,
min,1.101535e+18,,,,,,,,,,...,,,,,,,,,,
25%,1.126285e+18,,,,,,,,,,...,,,,,,,,,,
50%,1.128714e+18,,,,,,,,,,...,,,,,,,,,,
75%,1.130296e+18,,,,,,,,,,...,,,,,,,,,,


__Step 4: Convert created_at to DateTime Format__
- The created_at column in the dataset contains timestamps that include timezone information. 
- We need to convert it to a datetime format that pandas can understand:

In [None]:
# Convert 'created_at' to datetime
df['created_at'] = pd.to_datetime(df['created_at'], utc=True) 

__Step 5: Run the Query Function__

In [41]:
def query_data(term):
    # Filter for tweets containing the term
    term_tweets = df[df['text'].str.contains(term, case=False, na=False)]
    
    if term_tweets.empty:
        return f"No tweets found with the term '{term}'"

    # 1. How many tweets were posted containing the term on each day?
    tweets_per_day = term_tweets.groupby(term_tweets['created_at'].dt.date)['id'].count()

    # 2. How many unique users posted a tweet containing the term?
    unique_users = term_tweets['author_id'].nunique()

    # 3. How many likes did tweets containing the term get, on average?
    average_likes = term_tweets['like_count'].mean()

    # 4. Where (in terms of place IDs) did the tweets come from?
    unique_places = term_tweets['place_id'].unique()

    # 5. What times of day were the tweets posted at?
    term_tweets['hour'] = term_tweets['created_at'].dt.hour
    tweets_per_hour = term_tweets.groupby('hour')['id'].count()

    # 6. Which user posted the most tweets containing the term?
    most_active_user = term_tweets['author_id'].value_counts().idxmax()

    # Formatting the results for easy reading
    result = f"""
    Query Results for the term: '{term}'
    
    1. Number of tweets containing the term on each day:
    {tweets_per_day}
    
    2. Number of unique users who posted a tweet containing the term: {unique_users}
    
    3. Average likes for tweets containing the term: {average_likes}
    
    4. Place IDs from where the tweets were posted: {unique_places}
    
    5. Number of tweets posted at each hour of the day:
    {tweets_per_hour}
    
    6. User who posted the most tweets containing the term: {most_active_user}
    """
    
    return result


term = 'music'
print(query_data(term))


    Query Results for the term: 'music'
    
    1. Number of tweets containing the term on each day:
    created_at
2019-03-12      3
2019-04-06      1
2019-04-14      1
2019-04-16      1
2019-04-21      1
2019-04-24      1
2019-04-26      1
2019-04-27      3
2019-04-28     22
2019-04-29    118
2019-04-30    135
2019-05-01     71
2019-05-02     72
2019-05-03    103
2019-05-04     75
2019-05-05     65
2019-05-06     71
2019-05-07     64
2019-05-08     60
2019-05-09     70
2019-05-10    307
2019-05-11     78
2019-05-12     69
2019-05-13     50
2019-05-14     67
2019-05-15     99
2019-05-16     91
2019-05-17    162
2019-05-18     61
2019-05-19     46
2019-05-20     64
2019-05-21    119
2019-05-22     80
2019-05-23     73
2019-05-24     49
2019-05-25     92
2019-05-26    112
2019-05-27     76
2019-05-28     96
2019-05-29    187
2019-05-30     71
2019-05-31     14
Name: id, dtype: int64
    
    2. Number of unique users who posted a tweet containing the term: 2109
    
    3. Average lik

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  term_tweets['hour'] = term_tweets['created_at'].dt.hour


__analyzing how many tweets contain specific hashtags__

In [45]:
hashtag_count = df['hashtags'].value_counts()
hashtag_count.head(10)


None                 75875
['FreeBritney']       3923
['BritneySpears']      717
['jotafm']             281
['NowPlaying']         280
['freebritney']        222
['JanetJackson']       205
['FREEBRITNEY']        143
['RT']                 139
['WWHL']                95
Name: hashtags, dtype: int64

__tweets containing certain words (like "music")__

In [44]:
music_tweets = df[df['text'].str.contains('music', case=False)]
music_tweets.head()

Unnamed: 0,id,event,ts1,ts2,from_stream,directly_from_stream,from_search,directly_from_search,from_quote_search,directly_from_quote_search,...,retweeted,retweeted_author_id,retweeted_handle,retweeted_follower_count,mentioned_author_ids,mentioned_handles,hashtags,urls,media_keys,place_id
89,1131558142569959424,britney_201904,2022-02-28 09:34:49.489632-05:00,2022-02-28 09:34:49.489632-05:00,True,True,False,False,False,False,...,,,,,,,,,,
191,1134153062384357378,britney_201904,2022-01-04 15:45:46.545877-05:00,2022-02-28 09:32:01.642704-05:00,True,True,False,False,False,False,...,1.1341496656063078e+18,4872315238.0,Bryan_KnowsBest,308.0,,,,,,
275,1130933441191710720,britney_201904,2022-01-04 15:46:39.404655-05:00,2022-02-28 09:35:43.998377-05:00,True,True,False,False,False,False,...,1.13065444449502e+18,16339785.0,musicdotjunkee,32389.0,,,['FreeBritney'],,,
339,1129818710095613952,britney_201904,2022-01-04 15:47:01.723682-05:00,2022-02-28 09:37:33.446307-05:00,True,True,False,False,False,False,...,1.1287053038628084e+18,1.030185143460397e+18,EatPrayBritney,9329.0,,,,,,
341,1128705303862808577,britney_201904,2022-01-04 15:47:01.748329-05:00,2022-02-28 09:40:48.913892-05:00,True,True,False,False,False,False,...,,,,,,,['FREEBRITNEY'],,,


# Key Design Choices

__1. Pandas for Data Handling:__
- We use pandas to handle the dataset because it is fast, well-documented, and easy to use for data analysis and manipulation. 

__2. Search Functionality:__
- The system filters tweets that contain a specific term, allowing for flexible searches. We use str.contains() to check if the term appears in the tweet's text.

__3. Datetime Conversion:__
- Converting the created_at column to datetime with utc=True is crucial to correctly handling timezone-aware timestamps. This ensures consistent time-based queries (e.g., by day, hour).

__4. Efficient Grouping:__
- We group tweets by day, hour, and users efficiently using groupby(), which is optimized for performance with large datasets.

__5. User-Focused Results:__
- We provide actionable insights (e.g., the most active user, average likes) to help users quickly understand trends.
