<img src="../../../images/banners/pandas-cropped.jpeg" width="600"/>

<a class="anchor" id="intro_to_data_structures"></a>
# <img src="../../../images/logos/pandas.png" width="23"/> DataFrame Mini Project: Twitter Data

## <img src="../../../images/logos/toc.png" width="20"/> Table of Contents
* [Required Libraries](#required-libraries)
* [Connect to the API](#connect-api)
* [Call the API](#call-api)
* [Read Dumped Data](#read-dumped-data)

---

<a class="anchor" id="required-libraries"></a>
# Required Libraries

Twitter has API that you can use to extract tweets and users. Here we are using tweepy library to do so.

In [1]:
!pip install tweepy



In [2]:
import tweepy

In [3]:
from pathlib import Path
import pandas as pd
import json
import os
from tqdm import tqdm

<a class="anchor" id="connect-api"></a>
# Connect to the API

To use twitter API, you first need to get a developer account. Read [here](https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api) to learn how to do that.

After you get an account, you can get access tokens and keys to authenticate and call the API.

In [4]:
CONSUMER_KEY = os.environ['CONSUMER_KEY']
CONSUMER_SECRET = os.environ['CONSUMER_SECRET']
ACCESS_TOKEN = os.environ['ACCESS_TOKEN']
ACCESS_TOKEN_SECRET = os.environ['ACCESS_TOKEN_SECRET']
BEARER_TOKEN = os.environ['BEARER_TOKEN']

In [6]:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True)

<a class="anchor" id="call-api"></a>
# Call the API

In [25]:
tweets = []
dump_path = Path('./data/twitter_data/')
if not dump_path.exists():
    dump_path.mkdir()

for page in tqdm(tweepy.Cursor(
    api.search_tweets,
    tweet_mode='extended',
    q = "#machine_learning",
    count = 10,
    # lang="en",
).pages(2)):
    for tweet in page:
        json_data = tweet._json
        with open(dump_path / f'{json_data["id"]}.json', 'w') as f:
            json.dump(json_data, f)

2it [00:00,  3.78it/s]


<a class="anchor" id="read-dumped-data"></a>
# Read Dumped Data

In [26]:
DATA_DIR = Path('./data/twitter_data')

In [27]:
def read_json(file_path):
    with open(file_path) as f:
        return json.load(f)

In [28]:
rows = []
for file_path in tqdm(DATA_DIR.iterdir()):
    if file_path.is_dir():
        continue
    d = read_json(file_path)
    
    rows.append(dict(
        name = d['user']['name'],
        followers = d['user']['followers_count'],
        following = d['user']['friends_count'],
        follower_following_ratio =  d['user']['followers_count'] / (d['user']['friends_count'] + 1),
        text = d.get('full_text') or d.get('text'),
        hashtags = list(map(lambda item: item['text'], d['entities']['hashtags'])),
        likes = d['favorite_count'],
        retweets = d['retweet_count'],
    ))

20it [00:00, 3927.62it/s]


In [29]:
df = pd.DataFrame(rows)

In [30]:
pd.set_option('display.min_rows', 20)
pd.set_option('display.max_colwidth', 200)

In [31]:
df

Unnamed: 0,name,followers,following,follower_following_ratio,text,hashtags,likes,retweets
0,Coding Buddy,9,2,3.0,RT @lucifer_twtt: #30DaysOfCodechallenge\nDay22\nNot done much today. Just started eda for new project\n#30Daysofcode challenge to train my mo…,"[30DaysOfCodechallenge, 30Daysofcode]",0,5
1,Unemployed Professor,29,176,0.163842,Let's handle your;\n#Homework \n#Machine_Learning \n#Data_Science\n#Assignments\n#Stats \n#Fall_classes\n#Pearsons\n#Stats_Class\n#Finals\n#Python\n#R_programming_Language\n#Stata\n#Spss\n#JavaScr...,"[Homework, Machine_Learning, Data_Science, Assignments, Stats, Fall_classes, Pearsons, Stats_Class, Finals, Python, R_programming_Language, Stata, Spss, JavaScript, We_deliver]",0,2
2,Akintayo,68,36,1.837838,RT @stats_helpers: We can complete your;\n#Homework \n#Machine_Learning \n#Data_Science\n#Assignments\n#Stats \n#Fall_classes\n#Pearsons\n#Stats_Cl…,"[Homework, Machine_Learning, Data_Science, Assignments, Stats, Fall_classes, Pearsons]",0,2
3,ام محمد,35,87,0.397727,RT @AshwagAlbukhari: 📍📍📍\nhappening now as part of “#Ai for #precision_medicine” student program @UniofOxford \n\n“Introduction to coding in #…,"[Ai, precision_medicine]",0,10
4,Mr Data Scientist,10965,270,40.461255,RT @lucifer_twtt: #30DaysOfCodechallenge\nDay22\nNot done much today. Just started eda for new project\n#30Daysofcode challenge to train my mo…,"[30DaysOfCodechallenge, 30Daysofcode]",0,5
5,#30DaysOfCode,2318,1,1159.0,RT @lucifer_twtt: #30DaysOfCodechallenge\nDay22\nNot done much today. Just started eda for new project\n#30Daysofcode challenge to train my mo…,"[30DaysOfCodechallenge, 30Daysofcode]",0,5
6,SUPER WRITERS,178,441,0.402715,We can complete your;\n#Homework \n#Machine_Learning \n#Data_Science\n#Assignments\n#Stats \n#Fall_classes\n#Finals\n#Pearson\n#Python\n#R_programming_Language\n#Stata\n#Spss\n#JavaScript\nGet Qui...,"[Homework, Machine_Learning, Data_Science, Assignments, Stats, Fall_classes, Finals, Pearson, Python, R_programming_Language, Stata, Spss, JavaScript, We_deliver]",0,2
7,Xeron Bot,2309,1,1154.5,RT @Tutor_Nolan: Let's handle your;\n#Homework \n#Machine_Learning \n#Data_Science\n#Assignments\n#Stats \n#Fall_classes\n#Pearsons\n#Stats_Class\n#…,"[Homework, Machine_Learning, Data_Science, Assignments, Stats, Fall_classes, Pearsons, Stats_Class]",0,2
8,Utibe-Abasi Jacob Udoh,423,888,0.475816,RT @Tutor_Nolan: Let's handle your;\n#Homework \n#Machine_Learning \n#Data_Science\n#Assignments\n#Stats \n#Fall_classes\n#Pearsons\n#Stats_Class\n#…,"[Homework, Machine_Learning, Data_Science, Assignments, Stats, Fall_classes, Pearsons, Stats_Class]",0,2
9,Lonewollff🥑,539,3479,0.154885,@datawithsuman @SaveToNotion #machine_learning,[machine_learning],0,0
