## Identify a question that Twitter data (or data gathered from another API) could help you to answer. Figure out a query that would gather that data.

Explain the question you would like to answer, and why it is interesting.

Describe the data that you would like to gather, the analysis on the data that you would like to perform, and why it would answer the question



#### Is there a relationship between the number of followers a person has or the number of people they follow and the amount of tweets they put out per day?

#### For this, I will gather follower data for a set of people and the number of tweets they have posted. Then I will look for correlation between these to see if there is a relationship.

## Gathering Data

Save the raw data that Twitter returns in a .json file

Write code that filters the JSON data into what you are interested in and saves it as a CSV file.


In [2]:
import tweepy
from IPython.display import display, Image  # This line lets you display images. We'll use that in a bit.
import requests
import json
from twitter_authentication import CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [3]:
tweet_list = []
the_tweets = tweepy.Cursor(api.search, 'geocode:39.83,-98.58,10000km').items(1)
for tweet in the_tweets:
    if tweet.user.screen_name not in tweet_list:
        tweet_list.append(tweet._json)
with open('my twitter data', 'w') as file:
    json.dump(tweet_list, file)

    

In [None]:
with open('my twitter data', 'r') as file:
    tweets = json.load(file)
    


In [None]:
import csv 
with open('twitter data.csv', 'w') as csvfile:
    file = csv.writer(csvfile)
    file.writerow(['followers_count',
                   'friends_count',
              'statuses_count'
            ])
    for tweet in tweets:
        file.writerow([tweet['user']['followers_count'],
                    tweet['user']['friends_count'],
                    tweet['user']['statuses_count']
                   ])



## Analysis

Come up with a visualization and/or statistical test that would help to answer your question.

Create that visualization or run that statistical test.

In [None]:
import pandas as pd
import seaborn as sns
from scipy import stats
import numpy as np
df = pd.read_csv('./twitter data.csv')
df2 = pd.read_csv('./twitter data outliers removed.csv')

In [None]:

sns.lmplot(x = 'followers_count',
           y = 'statuses_count',
           height = 5, aspect = 1.25,
           data=df)
sns.lmplot(x = 'friends_count',
           y = 'statuses_count',
           height = 5, aspect = 1.25,
           data=df)
sns.lmplot(x = 'friends_count',
           y = 'followers_count',
           height = 5, aspect = 1.25,
           data=df)

In [None]:
foscorr, fosp = stats.pearsonr(df['followers_count'], df['statuses_count'])
frscorr, frsp = stats.pearsonr(df['friends_count'], df['statuses_count'])
fofrcorr, fofrp = stats.pearsonr(df['followers_count'], df['friends_count'])
foscorr2, fosp2 = stats.pearsonr(df2['followers_count'], df2['statuses_count'])
frscorr2, frsp2 = stats.pearsonr(df2['friends_count'], df2['statuses_count'])
fofrcorr2, fofrp2 = stats.pearsonr(df2['followers_count'], df2['friends_count'])
print("There is a ", foscorr, "correlation between the number of followers and the number of statuses, p = ", fosp)
print("There is a ", frscorr, "correlation between the number of friends and the number of statuses, p = ", frsp)
print("There is a ", fofrcorr, "correlation between the number of followers and the number of friends, p = ", fofrp)

In [None]:
sns.lmplot(x = 'followers_count',
           y = 'statuses_count',
           height = 5, aspect = 1.25,
           data=df2)
sns.lmplot(x = 'friends_count',
           y = 'statuses_count',
           height = 5, aspect = 1.25,
           data=df2)
sns.lmplot(x = 'friends_count',
           y = 'followers_count',
           height = 5, aspect = 1.25,
           data=df2)

foscorr2, fosp2 = stats.pearsonr(df2['followers_count'], df2['statuses_count'])
frscorr2, frsp2 = stats.pearsonr(df2['friends_count'], df2['statuses_count'])
fofrcorr2, fofrp2 = stats.pearsonr(df2['followers_count'], df2['friends_count'])
print("If we remove three major outliers, there is now a ", foscorr2, """correlation between the number of followers 
and the number of statuses, p = """, fosp, ", a" , frscorr, """correlation between the number of friends 
and the number of statuses, p = """, frsp, ", and a ", fofrcorr, """correlation between the number of 
followers and the number of friends, p = """, fofrp)