# Tweet mining using Twitter API via Tweepy:

In this notebook I am using Tweepy python library to  tweets using relevant hashtags. I was able to retrieve around 19000 unique tweets via twitter API. At the end, all the datasets with different depressive hashtags will be combined, cleaned and saved as depressive_tweets.csv.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Tweets mining

In [2]:
!pip install -qqq tweepy

In [3]:
## Import required libraries
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import csv
import pandas as pd

## Access to twitter API cunsumer_key and access_secret
#import config.ipynb

In [4]:
## Twitter API related information
consumer_key = config.API_KEY
consumer_secret = config.API_KEY_SECRET
access_key= config.ACCESS_TOKEN
access_secret = config.ACCESS_TOKEN_SECRET

In [5]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret) # Pass in Consumer key and secret for authentication by API
auth.set_access_token(access_key, access_secret) # Pass in Access key and secret for authentication by API
api = tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True) # Sleeps when API limit is reached

In [None]:
## depress_tags = ["#depressed", "#anxiety", "#depression", "#suicide", "#mentalhealth"
##                "#loneliness", "#hopelessness", "#itsokaynottobeokay", "#sad"]

## "#depressed"

In [6]:
## Create a function for tweets mining
def tweets_mining1(search_query1, num_tweets1, since_id_num1):
  # Collect tweets using the Cursor object
  # Each item in the iterator has various attributes that you can access to get information about each tweet
  tweet_list1 = [tweets for tweets in tweepy.Cursor(api.search, q=search_query1, lang="en", since_id=since_id_num1, 
                                                    tweet_mode='extended').items(num_tweets1)]
  
  # Begin scraping the tweets individually:
  for tweet in tweet_list1[::-1]:
    tweet_id = tweet.id # get Tweet ID result
    created_at = tweet.created_at # get time tweet was created
    text = tweet.full_text # retrieve full tweet text
    location = tweet.user.location # retrieve user location
    retweet = tweet.retweet_count # retrieve number of retweets
    favorite = tweet.favorite_count # retrieve number of likes
    with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_depressed_1.csv','a', newline='', encoding='utf-8') as csvFile1:
      csv_writer1 = csv.writer(csvFile1, delimiter=',') # create an instance of csv object
      csv_writer1.writerow([tweet_id, created_at, text, location, retweet, favorite]) # write each row

In [7]:
search_words1 = "#depressed" # Specifying exact phrase to search
# Exclude Links, retweets, replies
search_query1 = search_words1 + " -filter:links AND -filter:retweets AND -filter:replies" 
with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_depressed_1.csv', encoding='utf-8') as data:
    latest_tweet = int(list(csv.reader(data))[-1][0]) 
tweets_mining1(search_query1, 1000, latest_tweet)

In [8]:
df_depressed_1 = pd.read_csv("/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_depressed_1.csv",
                 names=['tweet.id', "created_at","text", "location", "retweet", "favorite"])

In [9]:
df_depressed_1

Unnamed: 0,tweet.id,created_at,text,location,retweet,favorite
0,1446882366945837057,2021-10-09 16:56:52,I totally need someone to hug me TIGHT and say...,,0,1
1,1446896799860539394,2021-10-09 17:54:13,i plan on committing suicide today or tommorro...,,0,1
2,1446912210672959491,2021-10-09 18:55:28,Exhausted! Absolutely exhausted and my day isn...,Lost 🤕,0,8
3,1446931930537209856,2021-10-09 20:13:49,Im going to get Far Cry 6 and playing video ga...,,0,1
4,1446934914453082113,2021-10-09 20:25:41,Just #depressed haven’t made money in 4 days o...,Daddy’s lap.,0,2
...,...,...,...,...,...,...
1440,1459292661848883203,2021-11-12 22:50:57,it gets dark at 5 now. #depressed,"Toronto, Ontario",0,2
1441,1459295472993153030,2021-11-12 23:02:07,"Ignore my tweets, if I tweet, for the next cou...","Paisley, Scotland",0,1
1442,1459323510803759108,2021-11-13 00:53:32,how tf you a psychology major and depressed? l...,"San Diego, CA",0,0
1443,1459376207527440385,2021-11-13 04:22:56,Liquors my bestie till my flight tomorrow fml ...,"Dreamville, LBC♥",0,0


In [10]:
## Finding unique values in each column
for col in df_depressed_1:
    print("There are ", len(df_depressed_1[col].unique()), "unique values in ", col)

There are  849 unique values in  tweet.id
There are  849 unique values in  created_at
There are  843 unique values in  text
There are  383 unique values in  location
There are  7 unique values in  retweet
There are  25 unique values in  favorite


### Anxiety and suicide 

In [11]:
## Create a function for tweets mining
def tweets_mining2(search_query2, num_tweets2, since_id_num2):
  # Collect tweets using the Cursor object
  # Each item in the iterator has various attributes that you can access to get information about each tweet
  tweet_list2 = [tweets for tweets in tweepy.Cursor(api.search, q=search_query2, lang="en", since_id=since_id_num2, 
                                                    tweet_mode='extended').items(num_tweets2)]
  
  # Begin scraping the tweets individually:
  for tweet in tweet_list2[::-1]:
    tweet_id = tweet.id # get Tweet ID result
    created_at = tweet.created_at # get time tweet was created
    text = tweet.full_text # retrieve full tweet text
    location = tweet.user.location # retrieve user location
    retweet = tweet.retweet_count # retrieve number of retweets
    favorite = tweet.favorite_count # retrieve number of likes
    with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_anxiety_1.csv','a', newline='', encoding='utf-8') as csvFile2:
      csv_writer2 = csv.writer(csvFile2, delimiter=',') # create an instance of csv object
      csv_writer2.writerow([tweet_id, created_at, text, location, retweet, favorite]) # write each row

In [12]:
search_words2 = "#anxiety" # Specifying exact phrase to search
# Exclude Links, retweets, replies
search_query2 = search_words2 + " -filter:links AND -filter:retweets AND -filter:replies"
with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_anxiety_1.csv', encoding='utf-8') as data:
    latest_tweet = int(list(csv.reader(data))[-1][0]) 
tweets_mining2(search_query2, 2000, latest_tweet)

In [13]:
df_anxiety_1 = pd.read_csv("/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_anxiety_1.csv",
                 names=['tweet.id', "created_at","text", "location", "retweet", "favorite"])

In [14]:
df_anxiety_1

Unnamed: 0,tweet.id,created_at,text,location,retweet,favorite
0,1447067749654614019,2021-10-10 05:13:31,I can't wait to get the hell out. so I'll jus...,,0,0
1,1447069714379857927,2021-10-10 05:21:19,Morning. All people except me sleeping. @Billy...,"Queenie's Castle,Yate, S Glos",0,1
2,1447072203388985346,2021-10-10 05:31:13,"On #WorldMentalHealthDay, a big shoutout to my...",Bengaluru/Muscat/Palakad/Kochi,0,9
3,1447072334825754626,2021-10-10 05:31:44,I hate having anxiety about doing stuff that I...,"Utah, USA",0,0
4,1447074986531848192,2021-10-10 05:42:16,"I am not scared of my ADHD, depression and anx...","Wollongong, New South Wales",2,11
...,...,...,...,...,...,...
6867,1459224031777939460,2021-11-12 18:18:14,It’s amazing how everyone runs to me as the su...,"Pennsylvania, USA",0,0
6868,1459224808512704516,2021-11-12 18:21:20,Any suggestions on settling the stomach after ...,"Everywhere, Anywhere",0,0
6869,1459228047278751747,2021-11-12 18:34:12,Gotta love that superpowered #anxiety taking h...,,0,0
6870,1459229518128893952,2021-11-12 18:40:02,Growth nor healing is linear. Sometimes you ma...,London,0,0


In [15]:
## Finding unique values in each column
for col in df_anxiety_1:
    print("There are ", len(df_anxiety_1[col].unique()), "unique values in ", col)

There are  4738 unique values in  tweet.id
There are  4733 unique values in  created_at
There are  4342 unique values in  text
There are  1381 unique values in  location
There are  33 unique values in  retweet
There are  80 unique values in  favorite


## "#Suicide"

In [10]:
## Create a function for tweets mining
def tweets_mining3(search_query3, num_tweets3, since_id_num3):
  # Collect tweets using the Cursor object
  # Each item in the iterator has various attributes that you can access to get information about each tweet
  tweet_list3 = [tweets for tweets in tweepy.Cursor(api.search, q=search_query3, lang="en", since_id=since_id_num3, 
                                                    tweet_mode='extended').items(num_tweets3)]
  
  # Begin scraping the tweets individually:
  for tweet in tweet_list3[::-1]:
    tweet_id = tweet.id # get Tweet ID result
    created_at = tweet.created_at # get time tweet was created
    text = tweet.full_text # retrieve full tweet text
    location = tweet.user.location # retrieve user location
    retweet = tweet.retweet_count # retrieve number of retweets
    favorite = tweet.favorite_count # retrieve number of likes
    with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_suicide_1.csv','a', newline='', encoding='utf-8') as csvFile3:
      csv_writer3 = csv.writer(csvFile3, delimiter=',') # create an instance of csv object
      csv_writer3.writerow([tweet_id, created_at, text, location, retweet, favorite]) # write each row

In [11]:
search_words3 = "#suicide" # Specifying exact phrase to search
# Exclude Links, retweets, replies
search_query3 = search_words3 + " -filter:links AND -filter:retweets AND -filter:replies" 
with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_suicide_1.csv', encoding='utf-8') as data:
    latest_tweet = int(list(csv.reader(data))[-1][0]) 
tweets_mining3(search_query3, 10000, latest_tweet)

In [12]:
df_suicide_1 = pd.read_csv("/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_suicide_1.csv",
                 names=['tweet.id', "created_at","text", "location", "retweet", "favorite"])

In [13]:
df_suicide_1

Unnamed: 0,tweet.id,created_at,text,location,retweet,favorite
0,1447381474034999296,2021-10-11 02:00:09,#suicide is the strong belief that no matter h...,,0,0
1,1447439429409415172,2021-10-11 05:50:26,"""suicide""\nHollowness enough\nSilence enough\n...",,2,2
2,1447444376464998400,2021-10-11 06:10:06,Every year passes but the pain remains the sam...,India,0,0
3,1447445469467131906,2021-10-11 06:14:26,Have I told you how much I hate my life😂😂😁 #su...,"Ohio, USA",0,1
4,1447461306295013377,2021-10-11 07:17:22,The man responsible for the #CDC policies that...,United States,1,2
...,...,...,...,...,...,...
713,1459446304577363971,2021-11-13 09:01:28,Someone wanted me to tell you. You're beautifu...,D(1) Florida,0,0
714,1459454059975352320,2021-11-13 09:32:17,It's a regular thing🙂💔\n#Coimbatore #suicide #...,"Tiruppur, India",0,3
715,1459454073644765185,2021-11-13 09:32:21,#Suicide is not as bad as people make it \n\nB...,The Chisolm Trail,0,0
716,1459495548373934081,2021-11-13 12:17:09,Just Uploaded My Review Of Dear Evan Hansen To...,,0,0


## "#hopelessness"

In [14]:
## Create a function for tweets mining
def tweets_mining4(search_query4, num_tweets4, since_id_num4):
  # Collect tweets using the Cursor object
  # Each item in the iterator has various attributes that you can access to get information about each tweet
  tweet_list4 = [tweets for tweets in tweepy.Cursor(api.search, q=search_query4, lang="en", since_id=since_id_num4, 
                                                    tweet_mode='extended').items(num_tweets4)]
  
  # Begin scraping the tweets individually:
  for tweet in tweet_list4[::-1]:
    tweet_id = tweet.id # get Tweet ID result
    created_at = tweet.created_at # get time tweet was created
    text = tweet.full_text # retrieve full tweet text
    location = tweet.user.location # retrieve user location
    retweet = tweet.retweet_count # retrieve number of retweets
    favorite = tweet.favorite_count # retrieve number of likes
    with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_hopeless_1.csv','a', newline='', encoding='utf-8') as csvFile4:
      csv_writer4 = csv.writer(csvFile4, delimiter=',') # create an instance of csv object
      csv_writer4.writerow([tweet_id, created_at, text, location, retweet, favorite]) # write each row

In [15]:
search_words4 = "#hopelessness" # Specifying exact phrase to search
# Exclude Links, retweets, replies
search_query4 = search_words4 + " -filter:links AND -filter:retweets AND -filter:replies"
with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_hopeless_1.csv', encoding='utf-8') as data:
    latest_tweet = int(list(csv.reader(data))[-1][0]) 
tweets_mining4(search_query4, 10000, latest_tweet)

In [16]:
df_hopeless_1 = pd.read_csv("/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_hopeless_1.csv",
                 names=['tweet.id', "created_at","text", "location", "retweet", "favorite"])

In [17]:
df_hopeless_1

Unnamed: 0,tweet.id,created_at,text,location,retweet,favorite
0,1447537898572574730,2021-10-11 12:21:43,Open discussion. Between the Transfer Portal a...,Cheyenne Wyoming,0,0
1,1447540582490988553,2021-10-11 12:32:23,Plenty of things are changing in my life and t...,,0,0
2,1447807717859491842,2021-10-12 06:13:53,I feel a little hopeless. Anyone else? #hopele...,,0,0
3,1448076026219692033,2021-10-13 00:00:03,"Which is more healthy? Hope, or hopelessness? ...","Denver, CO",0,0
4,1448382047375040513,2021-10-13 20:16:04,So someone tell me how do I get over #HOPELESS...,Portland Or .,0,2
5,1448595145138622464,2021-10-14 10:22:50,No parent deserves to experience the Indian le...,"Bombay, Dubai",1,4
6,1448843909841313793,2021-10-15 02:51:20,Being in a #union also looks a lot like being ...,"Alberta, Canada",7,17
7,1449848070783524864,2021-10-17 21:21:31,I am so glad that @GreysABC is tackling the hu...,,0,1
8,1447537898572574730,2021-10-11 12:21:43,Open discussion. Between the Transfer Portal a...,Cheyenne Wyoming,0,0
9,1447540582490988553,2021-10-11 12:32:23,Plenty of things are changing in my life and t...,,0,0


## "#mentalhealth"

In [18]:
## Create a function for tweets mining
def tweets_mining5(search_query5, num_tweets5, since_id_num5):
  # Collect tweets using the Cursor object
  # Each item in the iterator has various attributes that you can access to get information about each tweet
  tweet_list5 = [tweets for tweets in tweepy.Cursor(api.search, q=search_query5, lang="en", since_id=since_id_num5, 
                                                    tweet_mode='extended').items(num_tweets5)]
  
  # Begin scraping the tweets individually:
  for tweet in tweet_list5[::-1]:
    tweet_id = tweet.id # get Tweet ID result
    created_at = tweet.created_at # get time tweet was created
    text = tweet.full_text # retrieve full tweet text
    location = tweet.user.location # retrieve user location
    retweet = tweet.retweet_count # retrieve number of retweets
    favorite = tweet.favorite_count # retrieve number of likes
    with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_mentalhealth_1.csv','a', newline='', encoding='utf-8') as csvFile5:
      csv_writer5 = csv.writer(csvFile5, delimiter=',') # create an instance of csv object
      csv_writer5.writerow([tweet_id, created_at, text, location, retweet, favorite]) # write each row

In [19]:
search_words5 = "#mentalhealth" # Specifying exact phrase to search
# Exclude Links, retweets, replies
search_query5 = search_words5 + " -filter:links AND -filter:retweets AND -filter:replies" 
with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_mentalhealth_1.csv', encoding='utf-8') as data:
    latest_tweet = int(list(csv.reader(data))[-1][0])
tweets_mining5(search_query5, 1000, latest_tweet)

In [20]:
df_mentalhealth_1 = pd.read_csv("/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_mentalhealth_1.csv",
                 names=['tweet.id', "created_at","text", "location", "retweet", "favorite"])

In [21]:
df_mentalhealth_1

Unnamed: 0,tweet.id,created_at,text,location,retweet,favorite
0,1449685870945185792,2021-10-17 10:37:00,Sunday's goals. \n1. Take meds\n2. Drink 3 lit...,,0,1
1,1449686119658840065,2021-10-17 10:37:59,"""????"" #Mentalhealth\n\ni'm tired of fighting...",,0,0
2,1449686255185321986,2021-10-17 10:38:31,Surrounded by people but feeling so alone 😔 \n...,,0,1
3,1449686716168671232,2021-10-17 10:40:21,I understand my dv worker has emergencies but ...,,0,0
4,1449687397776592898,2021-10-17 10:43:04,Struggling to get out of bed and do things tha...,"England, United Kingdom",0,0
...,...,...,...,...,...,...
6592,1459531596009283600,2021-11-13 14:40:23,Let’s make good choices today friends!!! ❤️ #R...,"Florida, USA",0,1
6593,1459532754387976200,2021-11-13 14:45:00,Oh it’s a dark joke when I say I wanna bedazzl...,,0,1
6594,1459532763942604800,2021-11-13 14:45:02,I discovered today that clothes shopping is a ...,"England, United Kingdom",0,1
6595,1459532906074935304,2021-11-13 14:45:36,We composed a tweet thread about our college's...,,0,1


## "#loneliness"

In [22]:
## Create a function for tweets mining
def tweets_mining6(search_query6, num_tweets6, since_id_num6):
  # Collect tweets using the Cursor object
  # Each item in the iterator has various attributes that you can access to get information about each tweet
  tweet_list6 = [tweets for tweets in tweepy.Cursor(api.search, q=search_query6, lang="en", since_id=since_id_num6, 
                                                    tweet_mode='extended').items(num_tweets6)]
  
  # Begin scraping the tweets individually:
  for tweet in tweet_list6[::-1]:
    tweet_id = tweet.id # get Tweet ID result
    created_at = tweet.created_at # get time tweet was created
    text = tweet.full_text # retrieve full tweet text
    location = tweet.user.location # retrieve user location
    retweet = tweet.retweet_count # retrieve number of retweets
    favorite = tweet.favorite_count # retrieve number of likes
    with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_loneliness_1.csv','a', newline='', encoding='utf-8') as csvFile6:
      csv_writer6 = csv.writer(csvFile6, delimiter=',') # create an instance of csv object
      csv_writer6.writerow([tweet_id, created_at, text, location, retweet, favorite]) # write each row

In [23]:
search_words6 = "#loneliness" # Specifying exact phrase to search
# Exclude Links, retweets, replies
search_query6 = search_words6 + " -filter:links AND -filter:retweets AND -filter:replies" 
with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_loneliness_1.csv', encoding='utf-8') as data:
    latest_tweet = int(list(csv.reader(data))[-1][0])
tweets_mining6(search_query6, 10000, latest_tweet)

In [24]:
df_loneliness_1 = pd.read_csv("/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_loneliness_1.csv",
                 names=['tweet.id', "created_at","text", "location", "retweet", "favorite"])

In [25]:
df_loneliness_1

Unnamed: 0,tweet.id,created_at,text,location,retweet,favorite
0,1447444376464998400,2021-10-11 06:10:06,Every year passes but the pain remains the sam...,India,0,0
1,1447517473679441921,2021-10-11 11:00:33,"In this life, I can't expect things to be in m...",Davao Region,0,0
2,1447540227422162949,2021-10-11 12:30:58,holidays can bring on a sense of loss - of fam...,,0,0
3,1447564113928863744,2021-10-11 14:05:53,Must be good to have someone by your side. #Lo...,,0,0
4,1447599325304000515,2021-10-11 16:25:48,"#Artists without an air of #loneliness , are #...","Sulaimanyah, Kurdistan",0,5
...,...,...,...,...,...,...
306,1459371193283362820,2021-11-13 04:03:00,I want someone who loves to take nighttime dri...,"North Carolina, USA",0,0
307,1459473286836989959,2021-11-13 10:48:41,I have apparently reached the point of #autist...,"South West, England",0,1
308,1459491234473553921,2021-11-13 12:00:00,Give us a call. Need any advice with #covid19 ...,"Dublin City, Ireland",1,1
309,1459495762908401664,2021-11-13 12:18:00,fob lyrics trying so hard to be someone you’re...,she/they • 18 • scorpio,0,1


## "#itsokaynottobeokay"

In [26]:
## Create a function for tweets mining
def tweets_mining7(search_query7, num_tweets7, since_id_num7):
  # Collect tweets using the Cursor object
  # Each item in the iterator has various attributes that you can access to get information about each tweet
  tweet_list7 = [tweets for tweets in tweepy.Cursor(api.search, q=search_query7, lang="en", since_id=since_id_num7, 
                                                    tweet_mode='extended').items(num_tweets7)]
  
  # Begin scraping the tweets individually:
  for tweet in tweet_list7[::-1]:
    tweet_id = tweet.id # get Tweet ID result
    created_at = tweet.created_at # get time tweet was created
    text = tweet.full_text # retrieve full tweet text
    location = tweet.user.location # retrieve user location
    retweet = tweet.retweet_count # retrieve number of retweets
    favorite = tweet.favorite_count # retrieve number of likes
    with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_itsoknottobeok_1 copy.csv','a', newline='', encoding='utf-8') as csvFile7:
      csv_writer7 = csv.writer(csvFile7, delimiter=',') # create an instance of csv object
      csv_writer7.writerow([tweet_id, created_at, text, location, retweet, favorite]) # write each row

In [27]:
search_words7 = "#itsokaynottobeokay" # Specifying exact phrase to search
# Exclude Links, retweets, replies
search_query7 = search_words7 + " -filter:links AND -filter:retweets AND -filter:replies"
with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_itsoknottobeok_1 copy.csv', encoding='utf-8') as data:
    latest_tweet = int(list(csv.reader(data))[-1][0]) 
tweets_mining7(search_query7, 2000, latest_tweet)

In [28]:
df_itsok_1 = pd.read_csv("/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_itsoknottobeok_1 copy.csv",
                 names=['tweet.id', "created_at","text", "location", "retweet", "favorite"])

In [29]:
df_itsok_1

Unnamed: 0,tweet.id,created_at,text,location,retweet,favorite
0,1447444376464998400,2021-10-11 06:10:06,Every year passes but the pain remains the sam...,India,0,0
1,1447517473679441921,2021-10-11 11:00:33,"In this life, I can't expect things to be in m...",Davao Region,0,0
2,1447540227422162949,2021-10-11 12:30:58,holidays can bring on a sense of loss - of fam...,,0,0
3,1447564113928863744,2021-10-11 14:05:53,Must be good to have someone by your side. #Lo...,,0,0
4,1447599325304000515,2021-10-11 16:25:48,"#Artists without an air of #loneliness , are #...","Sulaimanyah, Kurdistan",0,5
...,...,...,...,...,...,...
160,1459084076250546178,2021-11-12 09:02:06,Every problem has a solution if you don’t know...,"South East, England",0,10
161,1459236894219325441,2021-11-12 19:09:21,"I'm loving @calumscott new song, definitely me...","Wrexham, Wales",0,3
162,1459270946485719041,2021-11-12 21:24:40,You ever stop to acknowledge : would you look...,United States,0,2
163,1459429100180111361,2021-11-13 07:53:07,i became teume bcoz of “ #itsokaynottobeokay ”...,,0,0


## "#depression"

In [30]:
## Create a function for tweets mining
def tweets_mining8(search_query8, num_tweets8, since_id_num8):
  # Collect tweets using the Cursor object
  # Each item in the iterator has various attributes that you can access to get information about each tweet
  tweet_list8 = [tweets for tweets in tweepy.Cursor(api.search, q=search_query8, lang="en", since_id=since_id_num8, 
                                                    tweet_mode='extended').items(num_tweets8)]
  
  # Begin scraping the tweets individually:
  for tweet in tweet_list8[::-1]:
    tweet_id = tweet.id # get Tweet ID result
    created_at = tweet.created_at # get time tweet was created
    text = tweet.full_text # retrieve full tweet text
    location = tweet.user.location # retrieve user location
    retweet = tweet.retweet_count # retrieve number of retweets
    favorite = tweet.favorite_count # retrieve number of likes
    with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_depression_1.csv','a', newline='', encoding='utf-8') as csvFile8:
      csv_writer8 = csv.writer(csvFile8, delimiter=',') # create an instance of csv object
      csv_writer8.writerow([tweet_id, created_at, text, location, retweet, favorite]) # write each row

In [31]:
search_words8 = "#depression" # Specifying exact phrase to search
# Exclude Links, retweets, replies
search_query8 = search_words8 + " -filter:links AND -filter:retweets AND -filter:replies"
with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_depression_1.csv', encoding='utf-8') as data:
    latest_tweet = int(list(csv.reader(data))[-1][0]) 
tweets_mining8(search_query8, 1000, latest_tweet)

In [32]:
df_depression_1 = pd.read_csv("/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_depression_1.csv",
                 names=['tweet.id', "created_at","text", "location", "retweet", "favorite"])

In [33]:
df_depression_1

Unnamed: 0,tweet.id,created_at,text,location,retweet,favorite
0,1447381882828623879,2021-10-11 02:01:46,#letstalk many suffering from #depression and ...,"Chicago, IL",0,0
1,1447387707362131970,2021-10-11 02:24:55,#Harassmentatwork can lead to debilitating men...,Lahore,1,1
2,1447396592877805570,2021-10-11 03:00:13,So . . . my #therapist called my wife and told...,"If it makes a difference, ask.",0,0
3,1447398472735342600,2021-10-11 03:07:41,#psychology #love #mentalhealth #therapy #heal...,,1,0
4,1447400177510146062,2021-10-11 03:14:28,#psychology #love #mentalhealth #therapy #heal...,,1,4
...,...,...,...,...,...,...
4478,1459517445736124420,2021-11-13 13:44:10,I've literally cried atleast once a day for th...,,0,0
4479,1459521433193877511,2021-11-13 14:00:00,Black cohosh (Cimicifuga racemosa) is a partic...,Global,1,1
4480,1459527712775847936,2021-11-13 14:24:58,"I mention therapy to him today, his response ""...",,0,1
4481,1459531002276192263,2021-11-13 14:38:02,Finna go to dollar tree and get some organizin...,"Dallas Texas, USA",0,0


In [14]:
## Finding unique values in each column
for col in df_depression_1:
    print("There are ", len(df_depression_1[col].unique()), "unique values in ", col)

There are  3185 unique values in  tweet.id
There are  3182 unique values in  created_at
There are  2818 unique values in  text
There are  939 unique values in  location
There are  23 unique values in  retweet
There are  59 unique values in  favorite


## "#sad"

In [34]:
## Create a function for tweets mining
def tweets_mining9(search_query9, num_tweets9, since_id_num9):
  # Collect tweets using the Cursor object
  # Each item in the iterator has various attributes that you can access to get information about each tweet
  tweet_list9 = [tweets for tweets in tweepy.Cursor(api.search, q=search_query9, lang="en", since_id=since_id_num9, 
                                                    tweet_mode='extended').items(num_tweets9)]
  
  # Begin scraping the tweets individually:
  for tweet in tweet_list9[::-1]:
    tweet_id = tweet.id # get Tweet ID result
    created_at = tweet.created_at # get time tweet was created
    text = tweet.full_text # retrieve full tweet text
    location = tweet.user.location # retrieve user location
    retweet = tweet.retweet_count # retrieve number of retweets
    favorite = tweet.favorite_count # retrieve number of likes
    with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_sad_1.csv','a', newline='', encoding='utf-8') as csvFile9:
      csv_writer9 = csv.writer(csvFile9, delimiter=',') # create an instance of csv object
      csv_writer9.writerow([tweet_id, created_at, text, location, retweet, favorite]) # write each row

In [35]:
search_words9 = "#sad" # Specifying exact phrase to search
# Exclude Links, retweets, replies
search_query9 = search_words9 + " -filter:links AND -filter:retweets AND -filter:replies" 
with open('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_sad_1.csv', encoding='utf-8') as data:
    latest_tweet = int(list(csv.reader(data))[-1][0]) 
tweets_mining9(search_query9, 2000, latest_tweet)

In [36]:
df_sad_1 = pd.read_csv("/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/tweets_sad_1.csv",
                 names=['tweet.id', "created_at","text", "location", "retweet", "favorite"])

In [37]:
df_sad_1

Unnamed: 0,tweet.id,created_at,text,location,retweet,favorite
0,1447386915502792706,2021-10-11 02:21:46,Tried to propose to Todd with an air ring duri...,MD/DC,0,4
1,1447389433553096704,2021-10-11 02:31:46,Forgetting to bring a post game pint to pickup...,Canada,0,1
2,1447390726132625416,2021-10-11 02:36:54,bro wtf i came to school because of him and he...,she / her | cbyf !!,0,0
3,1447390741706149895,2021-10-11 02:36:58,I agree with @clint_dempsey on the Yanks not w...,"Los Angeles, CA",0,0
4,1447391562380554244,2021-10-11 02:40:14,The amount of people who do not tip for grocer...,,0,1
...,...,...,...,...,...,...
3517,1459521498842992642,2021-11-13 14:00:16,Just got banned from a server F #sad,Jakarta Capital Region,0,1
3518,1459521611997003777,2021-11-13 14:00:43,I literally cried during my exam and the cam i...,بيت أمك,0,0
3519,1459524263946326017,2021-11-13 14:11:15,No one can be happy with a guy like me. That's...,"Varanasi, Uttar Pradesh, India",0,0
3520,1459530315437785095,2021-11-13 14:35:18,arrived at my house but Am I Home? #deep #sad ...,they19sea,1,3


# Combining all the tweets

In [38]:
import glob

In [39]:
path = r'/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API'  # use your path
all_files = glob.glob(path + "/*.csv")

tweets = []

for filename in all_files:
    df = pd.read_csv(filename, 
                     names=['tweet.id', "created_at","text", "location", "retweet", "favorite"]) # Convert each csv to a dataframe
    tweets.append(df)

tweets_df = pd.concat(tweets, ignore_index=True) # Merge all dataframes
#tweets_df.columns=['tweet.id', "created_at","text", "location", "retweet", "favorite"]
tweets_df.head()

Unnamed: 0,tweet.id,created_at,text,location,retweet,favorite
0,1447537898572574730,2021-10-11 12:21:43,Open discussion. Between the Transfer Portal a...,Cheyenne Wyoming,0,0
1,1447540582490988553,2021-10-11 12:32:23,Plenty of things are changing in my life and t...,,0,0
2,1447807717859491842,2021-10-12 06:13:53,I feel a little hopeless. Anyone else? #hopele...,,0,0
3,1448076026219692033,2021-10-13 00:00:03,"Which is more healthy? Hope, or hopelessness? ...","Denver, CO",0,0
4,1448382047375040513,2021-10-13 20:16:04,So someone tell me how do I get over #HOPELESS...,Portland Or .,0,2


In [40]:
tweets_df

Unnamed: 0,tweet.id,created_at,text,location,retweet,favorite
0,1447537898572574730,2021-10-11 12:21:43,Open discussion. Between the Transfer Portal a...,Cheyenne Wyoming,0,0
1,1447540582490988553,2021-10-11 12:32:23,Plenty of things are changing in my life and t...,,0,0
2,1447807717859491842,2021-10-12 06:13:53,I feel a little hopeless. Anyone else? #hopele...,,0,0
3,1448076026219692033,2021-10-13 00:00:03,"Which is more healthy? Hope, or hopelessness? ...","Denver, CO",0,0
4,1448382047375040513,2021-10-13 20:16:04,So someone tell me how do I get over #HOPELESS...,Portland Or .,0,2
...,...,...,...,...,...,...
24142,1459521498842992642,2021-11-13 14:00:16,Just got banned from a server F #sad,Jakarta Capital Region,0,1
24143,1459521611997003777,2021-11-13 14:00:43,I literally cried during my exam and the cam i...,بيت أمك,0,0
24144,1459524263946326017,2021-11-13 14:11:15,No one can be happy with a guy like me. That's...,"Varanasi, Uttar Pradesh, India",0,0
24145,1459530315437785095,2021-11-13 14:35:18,arrived at my house but Am I Home? #deep #sad ...,they19sea,1,3


In [41]:
tweets_df.to_csv('/content/drive/MyDrive/NLP/Depression_Detection/Data_fetch_API/output/depressive_tweets.csv')

## Data cleaning

Data cleaning is one of the essential steps because without a proper cleaning procedure you will have errors in your analysis and eventually your data-driven results. Here I try to eliminate duplicates tweets by using the Primary key ('tweets.id'), checked for empty rows and replaced “NaN” if there is any.

In [42]:
tweets_df.shape #Get number of rows and columns

(24147, 6)

In [43]:
## Check the data type of each column
tweets_df.dtypes.to_frame().rename(columns={0:'data_type'})

Unnamed: 0,data_type
tweet.id,int64
created_at,object
text,object
location,object
retweet,int64
favorite,int64


In [45]:
## Finding unique values in each column
for col in tweets_df:
    print("There are ", len(tweets_df[col].unique()), "unique values in ", col)

There are  18190 unique values in  tweet.id
There are  18071 unique values in  created_at
There are  17107 unique values in  text
There are  4648 unique values in  location
There are  74 unique values in  retweet
There are  159 unique values in  favorite
