### Introduction:
This project is updated on 22 Oct 2022 and the aim of this project is to do a twitter analysis for one or some more keywords.
Indeed, for our targeted keyword, we collect last 10k tweets, and then do some analysis.
The project is done in two separted parts: 
* Collecting Data
* Analyzing Data and Extracting Insights

For doing this porject, you need to have a tweeter developer account besides having some python packages installed that will be explained during the package.
This project is done by Darian Ghorbanian.

### Objectives in this notebook (First Part to Collect Data):
* Defining our Keyword/s
* Defining our Query and Collect 10k Tweets Data
* Collecting Authors' Data

### Required Packages and other Requiremets:
As it was mentioned above, fristly create a twitter developer account and after that, make an API which gives you keys for connecting. 
If you don't know how to do so, please follow this link: https://developer.twitter.com/en/support/twitter-api/developer-account .
When you create the API for your software, in your developer portal at Twitter website, you'll be able to reach the data details of your API including: API Key, API Key Secret, and Bearer Token.

For your data gathering, we need to have a couple of different packages installed. Here, are the list of all the packages you need for this part:

* tweepy
* json
* pandas

Most probably you know about json and pandas, but Tweepy can be something new:
Tweepy (https://www.tweepy.org/), is an easy-to-use Python library for assessing the Twitter API. 
To install we have:

In [22]:
!pip install tweepy



## Gathering Tweets Data:
##### The next step is to gather data using this package and the API, you've made. So, we have:

In [2]:
import tweepy

Pay attention that for collecting data, we need a class made based on the tweepy.
This class has a name TwitterCollector. You would need to have the file of related to code of this class, in the same directory that you have this notebook. Then, you would be able to run the code. As soon as you upload this file, just go for the following code:

In [3]:
from TwitterCollector import TwitterCollector

Now, it is time to collect your data. You need your bearer_token and also an initialization for twitter_collector. 
Pay attention, 'bearer_token' should be the thing you've received from your tweeter developer account. So, pay attention to change the below code:

In [4]:
bearer_token= r"AAAAAAAAAAAAAAAAAAAAAN7xhgEAAAAA5mDxdSxnPsiFrmzQ4R2Fvet71Qs%3DeAlmpOrJCjfAhrqhwnnbuZytWstuT1qdXZFLwPE6OZTo6ka2A6"
#initialize it:
tc= TwitterCollector(bearer_token=bearer_token)

#### Deciding about the query:
Now, you are prepared to collect data. Just you need to tell about the queries you need. In this project, we are going to go for the word "Mahsa Amini", which is one of the greatest trends in Twitter till the end of September 2022.
We don't want to have either "Mahsa" or "Amini", we want the exact word "Mahsa Amini".
Moreover, we don't want retweets to be counted again. And we are just considering English Tweets not, Persian. Although the majority of tweets are in Persian.

In [5]:
query1= '"Mahsa Amini" -is:retweet lang:en'

If you want to know more about how you can define some specific queries based on your interests, read instructions here: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query

For collecting tweets, we have a function fetch_recent_tweets that the inputs include query, the number of tweets we want, and saving the results or not. So, we have:

In [7]:
recent_tweets_ma= tc.fetch_recent_tweets(query=query1, tweets_cnt=10000, save_result=True)

When you run this code, a json file will be created in the folder that you are running the code. You need this file for the next steps to do text mining.

Before going for text mining, we can have a better look on the data we collected in recent_tweets_ma:

In [8]:
type(recent_tweets_ma)

dict

In [9]:
recent_tweets_ma.keys()

dict_keys(['collection_type', 'collection_timestamp', 'query', 'tweet_cnt', 'tweets'])

We see that we have a dictionary with 5 different keys. The last key is 'tweets' which includes more data about a tweet. Let's have a look on it:

In [10]:
type(recent_tweets_ma['tweets'])

list

In [11]:
type(recent_tweets_ma['tweets'][0])

dict

So, we can see that 'tweet' is a list of dictionaries. Actually it includes all of our 10000 tweets and each member of this list is a dictionary which includes detail of each tweet.

In [12]:
recent_tweets_ma['tweets'][0].keys()

dict_keys(['entities', 'referenced_tweets', 'created_at', 'source', 'in_reply_to_user_id', 'lang', 'text', 'context_annotations', 'edit_history_tweet_ids', 'public_metrics', 'possibly_sensitive', 'id', 'author_id'])

The information about each tweet are the keys printed from the last line code. Again, each of these keys maybe another dictionary.
Indeed, the json file we get from this twitter api is a dictionary of dictionary of ... of dictionaries.

But, as we saw, in the data collected, we have only author_id, but not more information. So, for our holistic analyis, we need to collect them.

## Gathering Authors' Data:
The only data, we have from the tweets' data is the author_id. So, we need to collect all the unique author_ids and then, collect more data for each author.

In [13]:
uniq_author_id=[]
for tweet in recent_tweets_ma['tweets']:
    if tweet['author_id'] not in uniq_author_id:
        uniq_author_id.append(tweet['author_id'])

In [28]:
print(type(uniq_author_id))
print(len(uniq_author_id))

<class 'list'>
3150


As we see, we have 3150 unique author id. It means that the 10000 tweets that we've collected are made by 3150 different authors.

#### Automating Author Data Collection
For collecting author's data, we need to fetch for each author. But the problem, if you don't have a professional tweeter developer accoutn, you can't fetch more than 300 times every 15 minutes. So, you need to collect your authors' data by waiting.
So, we need to have a code as bellow:

In [29]:
author=[]
import time
for i in range(0,len(uniq_author_id)):
    try:
        author.append(tc.fetch_author_info(uniq_author_id[i]))
    except AttributeError: # If the an author has removed their account, we should skip it
        print('Author ' + str(i) + ' has removed their account')
    except:
        print('we need to wait some time!!!. We have collected ' + str(i-1) + ' authors data till now')
        time.sleep(5*60) #when we reach our limit of fetching, we need to wait 15 minutes
        print('we waited 5 minutes, need to wait more 10 minutes')
        time.sleep(5*60)
        print('we waited 10 minutes, need to wait another 5 minutes')
        time.sleep(4*60)
        print('we will start collect data again in one minute')
        time.sleep(60)
        print('Starting Again...')
        author.append(tc.fetch_author_info(uniq_author_id[i])) #collecting the data for author who encountered Too Many Requests

Author 241 has removed their account
we need to wait some time!!!. We have collected 281 authors data till now
we waited 5 minutes, need to wait more 10 minutes
we waited 10 minutes, need to wait another 5 minutes
we will start collect data again in one minute
Starting Again...
we need to wait some time!!!. We have collected 429 authors data till now
we waited 5 minutes, need to wait more 10 minutes
we waited 10 minutes, need to wait another 5 minutes
we will start collect data again in one minute
Starting Again...
we need to wait some time!!!. We have collected 729 authors data till now
we waited 5 minutes, need to wait more 10 minutes
we waited 10 minutes, need to wait another 5 minutes
we will start collect data again in one minute
Starting Again...
we need to wait some time!!!. We have collected 1029 authors data till now
we waited 5 minutes, need to wait more 10 minutes
we waited 10 minutes, need to wait another 5 minutes
we will start collect data again in one minute
Starting Aga

We gathered authors data in 12 rounds. As we see, an author has removed their account. Approximately 3 hours was dedicated for this task. 
It is worth mentioning that if you have a professional tweeter developer account, you should not wait as long as 3 hours for similar data set.

In [32]:
len(author)

3149

So, the number 3149 affirms that we collected data for all avialable authors. 

The authors data should be saved. We prefer to save it as a json file.

In [35]:
import json
with open('author_data.json','w') as f:
    json.dump(author,f)

Our data is saved and the first part of our project is done.