# Are we what we tweet?
Nadine Ruecker <br>
Final Capstone Thinkful <br>
March - April 2019
## Introduction
Overweight and obesity are complex health issues. They are affected by genetics, behavior and society. Eating patterns, exercise moral and even use of medication, seem to be voluntary choices. Nevertheless, studies have shown that these choices are not entirely freely made, but are impacted by the social environment, especially in regards to eating and exercise habits, education and the reactions to advertisement and marketing strategies. Obesity causes an enormous financial burden on the health care system, as it is associated with many other diseases like diabetes, diseases of the coronary system (heart disease and hypertension), mental illness, osteoarthritis and even some forms of cancer. Furthermore, the overall productivity is impacted by an increased rate of absenteeism (being absent) and presenteeism (being present but unproductive) at work, and earlier onsets of disabilities(https://www.cdc.gov/obesity/adult/causes.html). In 2016, 64.8% of adults in the US were reported to be overweight. <br>
In this project, I would like to explore if signs of healthy and unhealthy population behavior can be observed in twitter data. I would like to explore the following two questions: <br>
1. Can the number of tweets in different categories predict health outcomes? <br>
2. Is the language between tweets mentioning healthy behavior different then tweets mentioning unhealthy behavior?
## Hypothesis
How many people are having a burger? How often do they go to a healthy restaurant? Who is sitting on the couch watching TV and who is exercising? <br>
Twitter generates an enormous amount of behavioral data in form of tweets, retweets, followers and likes. But can this data be representative of the population? My hypothesis is that tweet content reflects the eating and exercise behavior of the average population per State. I would like to predict how healthy the population of a certain US State is, without using data generated by health institutions (hospitals, insurances, central data collections), but rather using tweets. 
## Data
The project is based two independent data sets: Health indicators and features engineered using scraped tweets.
### Health indicating data
The data on health indicators originates from here (https://www.kff.org/). I collected data for the years 2013-2016, for these indicators: Diabetes rate, ER visits, hospital stays, inpatient days, obesity rates and self-estimated health, with values for each US State. For the modeling these indicators will be combined into the output variable Combi_Indi (combined indicators) and further converted into a categorical variable ('Health_Cat'). High values of Combi_Indi indicate bad health outcomes ( high obesity rates, high hospitals stays...). Health_Cat classifies the Combi_Indi values in 4 categories. A; very good (below -0.5), B good (-0.5 - 0), C: bad (0-0.5) and D: very bad (0.5<).
### Tweet data - Initial results
To classify tweets into healthy and unhealthy behavior, I did a preliminary analysis scraping tweets using the standard twitter API with tweepy as wrapper. I sampled 2000 tweets for around 200 queries, covering topics of healthy and unhealthy activities, food and resturants(https://github.com/NaRuecker/Final-Capstone/blob/master/Tweepy%20Extraction%20and%20Location%20determination%20-%20NoKeys.ipynb). 17-20% of tweets can be located to a US State using the location that a user saved in the user profile. The US based tweets were then group by the query and summarized into counts for each US State (e.g. 20 tweets mentioning yoga in NY). Tweet counts were then normalized to unequal sampling and population size of the respective State. A first preliminary analysis revealed, that the normalized counts have no correlation with the outcome variable. Only ratios between unhealthy and health tweet counts correlated with the outcome. However, this was only true for tweet ratios of food-queries. Activity and restaurants count ratios did not have any correlation to the health indicators.
![TweetCountCorrelation](Images/Combi_Health correlation.png)
Therefore, I did not pursue activities and restaurants any further.
### Historical tweet data - Constraints and work arounds
Historical data (beyond the past 7 days) is not accessible using the standard twitter API. In order to scrape tweets matching the health data for the years 2013-2016, I had two options: Using the twitter premium API or finding a web scraping workaround possibly through the advanced tweet search. <br> A twitter premium account is costly: Prices start at $399/month, which allows full archive access. But even the premium account is limited to 500 requests/month. The maximum number of tweets returned per request are 500 tweets. As I was aiming to scrape ~100 queries for 4 years of data each returning at least 2000 tweets, I would have needed (2000*100*4=800.000/500) 1600 requests. The premium API was not an option for this project. <br>
Finally, I decided to use the twint package to access historic data, without authentication and rate limits. Now, the limit was that in order to access the user’s location, I could not use twint as the User data call did not work for me (Windows operating system related issues https://github.com/twintproject/twint/issues/384). So I scarped the tweets using twint (https://github.com/NaRuecker/Final-Capstone/blob/master/TwintHistoricTweets.ipynb) and then looked up the user ID using the standard twitter API (https://github.com/NaRuecker/Final-Capstone/blob/master/Tweepy%20Extraction%20and%20Location%20determination%20-%20noKeys.ipynb). 



## Analysis and Results
### Can tweet counts in diverse categories predict health outcomes?
I tried three different general approaches to predict the Combi_Indi:
1. Normalized tweet counts per query, summarized using pca.
2. Normalized tweet count ratios combined with regional data and summarized using pca.
3. Features that were most correlated with the outcome variable. <br>
First, I tried to use the normalized tweet counts for each query, combined the shared variance observed using PCA and used the top 10 generated features as input into divers models. I performed a gridsearch for each model, but the modeling scores were disappointing. <br>
As the single query counts were too noisy to have any predictive value, I summarized them into healthy and unhealthy total counts. I also calculated a ratio of unhealthy over healthy normalized counts (UnH_ratio). To further categorize the States, I added regional and divisional data (https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States). As seen in the maps below, there are clear trends for the west coast or for example the southern States.
![TwoMaps](Images/Two_Maps.png)
As there was some collinearity between the different divisions and regions, I used PCA to combine features (see next figure A). The top 2 components explained more than 99% variance (B). Both components showed a decent amount of correlation with the outcome feature (C).
![CorrMa and pcas](Images/CorrMa_Pcas.png)
I used the two components as input features for several models. All models were run using a grid search to optimize their parameters and a 4-fold cross validation. The resulting R^2 values still showed high standard variations and partially negative values. 
![PCA Modeling Results](Images/PcaModel.JPG)
Surprisingly, the best model results were achieved, when using the 8 features with the highest correlation to the outcome variable:'UnH_ratio','Region_South','Region_West','Division_EastSouthCentral','Division_Mountain','Division_Pacific',
'Division_WestSouthCentral','Division_NewEngland'.
![Handselected Features Modeling Results](Images/HandSelectedModel.JPG)
I further tried to optimize the modeling using neural networks with sklearn and keras, but even in combination with grid search the resulting R^2 were negative or close to zero. The extremely bad performance of neural networks is probably due to the small size of the dataset (180 datapoints ~ 8 features).


### Is the language between tweets mentioning healthy behavior different then tweets mentioning unhealthy behavior?
In the first part of the project, I analyzed the occurrence of queries in tweets and classified them into 'healthy' und 'unhealthy' based on the search query. Per query I scraped around 8000 tweets (2000 per year), but for the NLP analysis I only kept the tweets that could be located to a US State. Tweets with the same text were removed, in order tp reduce the impact of retweets. So actually, less than 1500 tweets per query remained. I was not sure if this would be enough and unbiased enough to build a good training set for natural language processing approach differentiating healthy and unhealthy tweets.<br>
After generating tfidfmatrix and performing latent semantics analysis, it became clear, that the data I collected could be classified to some extent by the queries I had chosen.
![TweetsExamples](Images/LSAs.png)
In order to determine the general category of a tweet (unhealthy or healthy), I decided to use NLP. I used three different methods of feature generation: 
1. tfidf followed svd for latent semantics analysis
2. tfidf followed by latent dirichlet allocation
3. tfidf followed by non-negative matrix factorization
From each method I used the top 300 features as input into 6 different models, all with gridsearch for parameter optimization and 3 fold cross validation.
Best results were achieved using the tfidf-lsa data. Scores ranges in the high 0.9, with small standard deviations. The similarities found in the tweets are interestingly distinct from the features I generated using the queries.
![Tfidf-svdModel data](Images/Tfidf-svd-Model.png)
The NLP scripts can be found here(https://github.com/NaRuecker/Final-Capstone/blob/master/Final_Capstone_NLP_Model.ipynb).

### Predicting Health_Cat using tweet language attributes.
When trying to predict the absolute value of Combi_Ind using tweet counts be category the models had high MSE and very variable scores. I attribute this in particular to the size of the dataset (188, 8). This motivated me to try a different approach: First I converted the Combi_Id into a new categorical feature 'Health_Cat' with 4 different categories: A (very good health indicator values), B (good health indicator values), C(bad indicator values) and D (very bad health indicator values). 
![Combi_Indi_Hist](Images/Combi_Id_Hist.png)
I joined the health categorical data together with the regional data with the lsa data. Now instead of having one datapoint per year and State, I had thousands of tweets representing each health_indicator measurement per year and State. As the classes were unbalanced, I decided to down sample the dataset, so that the category with the lowest number of tweets (A) determined the total number of tweets used from each category. This still resulted in a much bigger dataset (11820,363).<br>
I tried several models to predict the 4 categories, each with 3 fold cross validation and gridsearch to optimize the models parameters. The strongest performance resulted from K-nearest neighbor with an accuracy of 0.948 and as the confusing matrix below indicates it had excellent precision for all 4 categories.
![Classification Results](Images/Class_Results.png)
The script can be found here(https://github.com/NaRuecker/Final-Capstone/blob/master/Classification%20of%20Combi_Indi.ipynb)
## Value and perspectives
My analysis demonstrates that tweet data can be used to predict health indicators. Obesity and diabetes rates measured today, are the results of unhealthy behavior from the past. Only after long-term unhealthy behavior, diseases like obesity and diabetes will manifest. However, as my analysis indicates; future obesity rates, can be monitored right now, by analyzing the populations behavior using tweets.br>
My analysis in its current state could be optimized in many ways: With a greater number of tweets and a longer time period studied, the predictions can probably be refined. Once enough data is gathered (probably at least a decade worth of data) it should also be considered doing a timeseries analysis. As I stated earlier the past behavior influences the current health indicators.<br>
The way I imagine the use of this data science tool is however to monitor more refined behavior. For example, if the State of Mississippi would roll out a campaign to improve people's eating behavior by visiting schools for example. Would there be a measurable effect in the tweets content? Could a whole generation be impacted, when influencers would start to promote healthy eating behavior? These are questions that this data science tool can answer.
