# An Analysis of Temporal and Meteorological Effects on Twitter Sentiments Across the United States : Towards Harnessing Data Science Tools for Monitoring Mental Health Issues in a Population

## Preamble -- Using Data Science to Help Diagnose and Treat Mental Health Issues

Mental health issues have generated a significant social stir due to their excruciating humane toll, the usual difficulty of their diagnosis, as well as the often-hidden socio-economic costs they inflict [1]. Social media provides rich, crowd-sourced data for a varierty of analytical projects, and may afford the scientfic commuity a useful vantage point for studying, diagnosing, and hopefully, mitigating mental health issues in a population. 

This project proposes to attempt the following:
    1. Mine social media data on Twitter for a few of the most densely populated areas of USA.
    2. Gather concurrent weather data for the above locations.
    3. Analyse tweet sentiments to study the effects of location, weather, time, etc. 
    
While this is only a small project borne out of the hobbies of a novice enthusiast in Data Science, it is hoped that this small prototype, in its own small way, will help raise awareness and interest over the the topic of mental health issues -- and encourage the use of science & technology to help diagnose/solve real-life problems. 

The motivation for this project to study social media sentiments and short-term environmental factors (weather, time of day, etc.) can be summed up by the following axioms: 
    1. We are products of our environment.
    2. We can make conscious choices regarding what we think and how we act.
    3. We can help make positive changes in the environment that shapes us.
    
In fact, axioms 2 and 3 above are often taken to be the basis of Cognitive-Behavioural Therapy (CBT) by the community of psychaitaric practitioners [2]. Once a fuller scientific understanding of social media sentiments is established, the power of this could be harnessed to deliver better public health interventions towards mental health issues. For example, Seasonal Affective Disorder (SAD) has well-established weather-based triggers [3]. Weather forecasts could be used to predict higher instances of this, social media data could be used to identify its occurrence, and perhaps, social media could also be used to offer support mechanisms for sufferers of SAD when their symptoms flare up. 

Hopefully, the potential of the Data Revolution that is upon us can be utilised to mitigate, if not alleviate, mental health afflictions in our society.
    
[1] "The Neglect of Mental Illness Exacts a Huge Toll, Human and Economic", Scientific American (2012): https://www.scientificamerican.com/article/a-neglect-of-mental-illness/

[2] "Cognitive-Behavioural Theray (CBT)", Centre for Addiction and Mental Health, University of Toronto (Undated): http://www.camh.ca/en/hospital/health_information/a_z_mental_health_and_addiction_information/CBT/Pages/default.aspx

[3] "Seasonal Affective Disorder (SAD)", The Mayo Clinic (Undated): http://www.mayoclinic.org/diseases-conditions/seasonal-affective-disorder/basics/definition/con-20021047


## Questions to be Addressed

    1. What are the different categories of weather to be found in USA?
        * How does the geographical location of US cities correlate with weather data?
        * What factors does weather depend upon?
    2. How does the circadian rhythm affect the sentiment of tweets?
    3. Which regions are most likely to be positive/negative in their tweets?
    4. What are people tweeting about (and why)?

## Overview of Tools and Strategies -- Twitter and OpenWeatherMap, Sentiment Analysis via AFINN and NLTK, SciKit Learn for Analytics (k-Means Clusters, PCA, LDA/NMF Topic Models), and Flask for Presentation of Topics

Twiiter data will be mined using the API provided by Twitter Inc. While geo-tagged tweets would have been ideal, they are rarer and significantly more difficult to mine than regular tweets. For an initial prototype, the user profile location shall be taken as a placeholder for actual location.

Weather data data will be obtained via the OpenWeatherMap API. 

Sentiment Analysis of the tweets will be undertaken using AFINN and NLTK.

SciKit Learn will be used for general-purpose data analysis: k-Means Clustering, Principal Component Analysis (PCA), as well as Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorisation (NMF).

A small Flask application will also be written to facilitate the presentation of tweet topics by location.

## Structure of the Project -- Files and Directories

The root directory (https://github.com/NearIdentity/Sentiments_Twitter-Weather) contains the following Python script files:
    
* TwitterAcquisitionAnalysis.py 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/TwitterAcquisitionAnalysis.py)
        * Uses the Twitter API to gather data over 24 hours by city and integrates them with weather data based on an OpenWeatherMap wrapper contained in another module. 
        * Ascribes sentiment scores to each tweet based on AFINN and NLTK 
    
* OpenWeatherMapData.py 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/OpenWeatherMapData.py)
Provides a wrapper API around the OpenWeatherMap API for use by the Twitter data acquisition/integration module contained in TwitterAcquisitionAnalysis.py
    
* PostProcesing.py 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/PostProcessing.py)
This is where the bulk of the data analysis occurs. All code shown in this Jupyter notebook belongs to PostProcessing.py unless otherwise noted.
    
* AncilliaeHTML.py 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/AncilliaeHTML.py)
Provides helper functions for creating HTML output for topic model data. These are to be later used to build a small web app in Flask

* Analysis.py 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/Analysis.py)
Provides helper functions for data analysis using SciKit Learn (Elbow Method plots for k-Means, PCA, as well as topic models, etc.).

* AFINN-111.txt 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/AFINN-111.txt)
AFINN data for sentiment scores [Details available at: http://www2.imm.dtu.dk/pubdb/view/publication_details.php?id=6010]

* data/ 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/tree/master/data)
Directory containing tweet data text with file-names of the form [<city>_full__pos|neg|neu.txt]; files are generated by TwitterAcquisitionAnalysis.py. Also contains image files for data plots created by PostProcessing.py
 
* data/integrated_data_combined.csv 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/data/integrated_data_combined.csv)
Data file generated by TwitterAcquisitionAnalysis.py containing tweet sentiments and weather data 

* FlaskApp/ 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/tree/master/FlaskApp)
Directory containing the Flask app for presenting topic model data

* FlaskApp/FlaskApp.py 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/FlaskApp/FlaskApp.py)
Flask implementation of web app for displaying topic models and word clouds for different cities
            
* FlaskApp/TopicIndex.html 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/FlaskApp/TopicIndex.html)
Main HTML index page for the Flask app; provides links to the toic models by city.
            
* FlaskApp/pages/ 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/tree/master/FlaskApp/pages)
Directory containing the HTML pages to be used by the Flask app
            
* FlaskApp/pages/images/ 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/tree/master/FlaskApp/pages/images) 
Directory containing images of word clouds created for the HTML files
                
* pos.txt 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/pos.txt)
Integrated data containing positive-sentiment tweets only from the cities polled

* neg.txt 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/neg.txt)
Integrated data containing negative-sentiment tweets only from the cities polled

* neu.txt 
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/neu.txt)
Integrated data containing neutral-sentiment tweets only from the cities polled

* TwitterSentiments-Weather.ipynb
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/TwitterSentiments-Weather.ipynb)
This Jupyter notebook

* TwitterSentiments-Weather.html
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/TwitterSentiments-Weather.ipynb)
HTML page based on this Jupyter notebook 

* TwitterSentiments-Weather_files/
(https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/TwitterSentiments-Weather_files)
Ancilliary files for HTML page based on this Jupyter notebook

## Post-Processing of Twitter+Weather Data (PostProcessing.py)

We have an integrated data file (https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/data/integrated_data_combined.csv) which contains tweet sentiment and local weather data. The file contains 41,000+ data-points collected over the span of about 24 hours from tweets sourced from the Twitter API and weather data from the OpenWeatherMap API. The data file was generated using TwitterAcquisitionAnalysis.py (https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/TwitterAcquisitionAnalysis.py) and we post-process said data using the script PostProcessing.py (https://github.com/NearIdentity/Sentiments_Twitter-Weather/blob/master/PostProcessing.py), which also serves as the basis of the present Jupyter notebook (with exceptions noted inline). 

### Importing Necessary Modules and Data

Let us start off by getting the necessary modules and data...

In [1]:
import pandas as pd 
import numpy as np
from sklearn.preprocessing import StandardScaler
from os import getcwd, path, mkdir
from Analysis import *
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from mpl_toolkits.basemap import Basemap

dataframe_full = pd.read_csv(getcwd()+"/data/integrated_data_combined.csv")
list_headers = list(dataframe_full)
array_coordinates = dataframe_full.values[:,0:2]
array_sentiments = dataframe_full.values[:,3:8]
array_weather0 = np.array(dataframe_full.values[:,8:13], dtype=float)
set_weather1 = set(dataframe_full.values[:,13])
array_weather1 = np.zeros((dataframe_full.values.shape[0], len(set_weather1)))
array_weather2 = np.array(dataframe_full.values[:,14:17], dtype=float)
array_weather3 = np.array(dataframe_full.values[:,18:19], dtype=float)
array_weather4 = np.array(dataframe_full.values[:,19:21], dtype=float)
array_day_night = np.zeros((dataframe_full.values.shape[0],2))
array_phase24h = np.empty((dataframe_full.values.shape[0],1))

list_headers_coordinates = list_headers[0:2]
list_headers_sentiments = list_headers[3:8]
list_headers_weather0 = list_headers[8:13]
list_headers_weather1 = [list_headers[13]]
list_headers_weather2 = list_headers[14:17]
list_headers_weather3 = list_headers[18:19]
list_headers_weather4 = list_headers[19:21]
list_headers_day_night = ["day(_)", "night(_)"]

### k-Means Clustering of Location (Elbow Method and Plots)

Let us start off with analysing coordinates of the locations. This data was "hard-coded" into the .csv file while reading tweets. There are many repetitions of the same coordinates, but there should only be a few unique ones.

In [2]:
'''
=======================================================
	Pre-Processing of Location Coordinates
=======================================================
'''

list_coordinates = [str(coord[0])+' '+str(coord[1]) for coord in array_coordinates]
set_coordinates = set(list_coordinates)
array_coordinates_unique = np.array([[float(unq_coord.split()[0]),float(unq_coord.split()[1])] for unq_coord in set_coordinates])

Let us try to use k-Means Clustering of the location coordinates with some plotting on a geographical map. We will use the Elbow Method to choose a good k-value. It turns out, the best k-value to use will be 3.

In [3]:
'''
==================================================
	Cluster Analysis of Locations
==================================================
'''

kMeans_elbow_method_plot(array_coordinates_unique, 20, getcwd()+"/data/Coordinates_kMeans_ElbowMethod.svg")

def create_background_map():
	plt.figure()
	background_map = Basemap(projection='stere', lat_0=+38.60, lon_0=-97.72, llcrnrlat=+22.00, urcrnrlat=+48.96, llcrnrlon=-122.68, urcrnrlon=-60.28, rsphere=6371200., resolution='l', area_thresh=10000)
	background_map.drawcoastlines()
	background_map.drawstates()                  
	background_map.drawcountries()
	parallels = np.arange(0.,90,10.)
	background_map.drawparallels(parallels,labels=[1,0,0,0],fontsize=10)
	meridians = np.arange(180.,360.,10.)
	background_map.drawmeridians(meridians,labels=[0,0,0,1],fontsize=10)
	
	return background_map

scaler_coordinates = StandardScaler()
kBest_coordinates = 3
list_colours = ['r', 'g', 'b']
array_coordinates_scaled_unique = scaler_coordinates.fit_transform(array_coordinates_unique)
kMeans_coordinates = kMeans_model(array_coordinates_scaled_unique, kBest_coordinates)
array_kMeans_coordinate_labels = kMeans_coordinates[1]
array_kMeans_coordinate_cen = kMeans_coordinates[2]

array_kMeans_coordinate_colours = np.array([list_colours[label] for label in array_kMeans_coordinate_labels])

array_latitudes = array_coordinates_unique[:, 0]
array_longitudes = array_coordinates_unique[:, 1]
bckgr_map = create_background_map()
x, y = bckgr_map(array_longitudes, array_latitudes)
bckgr_map.scatter(x, y, color=array_kMeans_coordinate_colours, s=200.0, marker='o', alpha=0.7)
plt.savefig(getcwd()+"/data/LocationClusters.svg")

  b = ax.ishold()
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
  ax.hold(b)


We will ignore the warning messages above. Let us see how plots of the Elbow Method and the location clusters turn out.

#### Elbow Method Plot for Optimal k-Means Clustering of Location Coordinates

![title](data/Coordinates_kMeans_ElbowMethod.svg)

The plot above seems to suggest an "elbow" at k=3.

#### Geographical Illustration of Location Clusters

![title](data/LocationClusters.svg)

We get a geographical distribution of locations that roughly seem to be as follows:
    * East Coast and Mid-West (red)
    * West Coast and the Rocky Mountains (blue)
    * South (green)

### Weather Data Pre-Processing

Let us try to engineer the features of the weather data somewhat. Let us deviate from the script file PostProcessing.py somewhat and inspect what we have for the weather data. Previously, we had separated the headers and the data values. 

In [6]:
list_headers_weather0

['temp(C)', 'temp_max(C)', 'temp_min(C)', 'humidity(%)', 'pressure(hPa)']

In [7]:
array_weather0[0:10,:]

array([[   19.88,    23.  ,    17.  ,    64.  ,  1015.  ],
       [   16.68,    21.  ,    12.  ,    68.  ,  1015.  ],
       [   19.88,    23.  ,    17.  ,    64.  ,  1015.  ],
       [   14.72,    17.  ,    13.  ,    72.  ,  1018.  ],
       [   24.92,    28.  ,    22.  ,    34.  ,  1011.  ],
       [   16.68,    21.  ,    12.  ,    68.  ,  1015.  ],
       [   12.2 ,    14.  ,    10.  ,   100.  ,  1019.  ],
       [   24.37,    26.  ,    23.  ,    78.  ,  1014.  ],
       [   23.88,    25.  ,    23.  ,    83.  ,  1013.  ],
       [   25.37,    27.  ,    24.  ,    83.  ,  1012.  ]])

All right. This is run-off-the-mill numerical data. Nothing out of the ordinary here.

In [8]:
list_headers_weather1

['sky(__)']

This will be a description of the sky conditions (in words) and will need to be converted into one-hot vectors. Let us see what kind of values we are dealing with...

In [10]:
set_weather1

{'Clear', 'Clouds', 'Drizzle', 'Fog', 'Haze', 'Mist', 'Rain', 'Thunderstorm'}

Now, back to the script, PostProcessing.py. Let us convert the descriptions of the sky that are mapped onto this seven-element set 

In [11]:
'''
=============================================
	Weather Data Pre-Processing	
=============================================
'''
# One-Hot Vectors from Weather Descriptions

dict_weather1 = {}
i_weather1 = 0
for desc_weather1 in set_weather1:
	dict_weather1[desc_weather1] = i_weather1
	i_weather1 += 1

i_weather1 = 0
for weather1 in dataframe_full.values[:,13]:
	array_weather1[i_weather1, dict_weather1[weather1]] = 1
	i_weather1 += 1

### Phase Angle Representation for Time of the Day -- Precursor for Studying the Impact of Circadian Rhythm on Tweet Sentiment Scores

We would like to study the effect of circadian rhythm on the tweet sentiments. Now, the raw time-stamp of a tweet is not a good indicator of what phase of the day we are in, as the number of daylight/nightly hours vary throughout the United States. We circumvent this issue by mapping the time-stamp for each tweet into what we will refer to as the "journal phase", $\delta$. The daytime values of this are designed to lie between 0 (at sunrise) and $+\pi$ (at sunset). At sunset, the journal phase jumps from $+\pi$ to $-\pi$, causing overnight values to lie between $-\pi$ (at sunset) and 0 (at sunrise).

We also convert the time-stamp of every tweet into a day/night one-hot vector for easier analysis. We treat these as parts of the weather conditions.

In [12]:
# Day/Night One-Hot Vectors and Phase Angle

array_time = dataframe_full.values[:,2]
array_sunrise = dataframe_full.values[:,14]
array_sunset = dataframe_full.values[:,15]
for i_day_night in range(dataframe_full.values.shape[0]):
	hours_day = array_sunset[i_day_night] - array_sunrise[i_day_night]
	hours_night = 24.0 - hours_day
	if (array_time[i_day_night] == array_sunrise[i_day_night]):	# Sunrise
		array_phase24h[i_day_night,0] = 0
	elif (array_time[i_day_night] == array_sunset[i_day_night]):	# Sunset
		array_phase24h[i_day_night,0] = np.pi
	elif (array_sunrise[i_day_night] < array_time[i_day_night]) and (array_time[i_day_night] < array_sunset[i_day_night]):	# Day
		array_day_night[i_day_night,0] = 1.0
		array_phase24h[i_day_night,0] = np.pi * (array_time[i_day_night] - array_sunrise[i_day_night])/hours_day
	else:	# Night
		array_day_night[i_day_night,1] = 1.0
		if array_time[i_day_night] < array_sunset[i_day_night]: # Past midnight
			array_time[i_day_night] += 24.0
		array_phase24h[i_day_night,0] = np.pi * (-1 + (array_time[i_day_night] - array_sunset[i_day_night])/hours_night)
		if array_time[i_day_night] >= 24.0: # Past midnight
			array_time[i_day_night] -= 24.0

### Collating Weather Data into a Single NumPy Array of Input

We apply a data scaler from SciKit Learn to the weather data in preparation for cluster analysis. However, we do not apply the scaler on the one-hot-vectorised data. 

We synthesise the resultant scaled data into a NumPy array for cluster analysis of weather data.

In [None]:
# Scaled Combination of Weather Data for Cluster Analysis

scaler_weather0 = StandardScaler()
array_weather0 = scaler_weather0.fit_transform(array_weather0)
scaler_weather2 = StandardScaler()
array_weather2 = scaler_weather2.fit_transform(array_weather2)
scaler_weather3 = StandardScaler()
array_weather3 = scaler_weather3.fit_transform(array_weather3)
#scaler_phase24h = StandardScaler()
#array_phase24h = scaler_phase24h.fit_transform(array_phase24h) 

array_weather = np.empty((dataframe_full.shape[0], array_weather0.shape[1] + array_weather1.shape[1] + array_weather2.shape[1] + array_weather3.shape[1] + 2))

array_weather[:,0:array_weather0.shape[1]] = array_weather0
array_weather[:,array_weather0.shape[1]:array_weather0.shape[1]+array_weather1.shape[1]] = array_weather1
array_weather[:,array_weather0.shape[1]+array_weather1.shape[1]:array_weather0.shape[1]+array_weather1.shape[1]+array_weather2.shape[1]] = array_weather2
array_weather[:,array_weather0.shape[1]+array_weather1.shape[1]+array_weather2.shape[1]:array_weather0.shape[1]+array_weather1.shape[1]+array_weather2.shape[1]+array_weather3.shape[1]] = array_weather3
array_weather[:,array_weather0.shape[1]+array_weather1.shape[1]+array_weather2.shape[1]+array_weather3.shape[1]:array_weather0.shape[1]+array_weather1.shape[1]+array_weather2.shape[1]+array_weather3.shape[1]+2] = array_day_night

### Cluster Analysis of Weather Data

We also use the column headings for the weather data to arrive at "explained" weather clusters, which help us acquire some intuition on what different classes of weather data we have acquired.

Using the Elbow Method, we determine that we need four classes for k-Means Classification of the weather data.

In [13]:
'''
=================================================
	Cluster Analysis of Weather Data
=================================================
'''

kMeans_elbow_method_plot(array_weather, 20, getcwd()+"/data/Weather_kMeans_ElbowMethod.svg")

kBest_weather = 4

kMeans_weather = kMeans_model(array_weather, kBest_weather)
array_weather_cen0 = scaler_weather0.inverse_transform(kMeans_weather[2][:,0:array_weather0.shape[1]])
array_weather_cen1 = kMeans_weather[2][:,array_weather0.shape[1]:array_weather0.shape[1]+array_weather1.shape[1]]
array_weather_cen2 = scaler_weather2.inverse_transform(kMeans_weather[2][:,array_weather0.shape[1]+array_weather1.shape[1]:array_weather0.shape[1]+array_weather1.shape[1]+array_weather2.shape[1]])
array_weather_cen3 = scaler_weather3.inverse_transform(kMeans_weather[2][:,array_weather0.shape[1]+array_weather1.shape[1]+array_weather2.shape[1]:array_weather0.shape[1]+array_weather1.shape[1]+array_weather2.shape[1]+array_weather3.shape[1]])

list_expl_weather0 = []
for cen in array_weather_cen0:
	weather0_expl = [ (list_headers_weather0[i_w0], cen[i_w0]) for i_w0 in range(len(list_headers_weather0)) ]
	print weather0_expl
	list_expl_weather0.append(weather0_expl)

list_expl_weather1 = []
for cen in array_weather_cen1:
	weather1_expl = sorted([ (desc, cen[dict_weather1[desc]]) for desc in dict_weather1.keys() ], key=lambda x: x[1], reverse=True)
	print weather1_expl
	list_expl_weather1.append(weather1_expl)

list_expl_weather2 = []
for cen in array_weather_cen2:
	weather2_expl = [ (list_headers_weather2[i_w2], cen[i_w2]) for i_w2 in range(len(list_headers_weather2)) ]
	print weather2_expl
	list_expl_weather2.append(weather2_expl)

list_expl_weather3 = []
for cen in array_weather_cen3:
	weather3_expl = [ (list_headers_weather3[i_w3], cen[i_w3]) for i_w3 in range(len(list_headers_weather3)) ]
	print weather3_expl
	list_expl_weather3.append(weather3_expl)

[('temp(C)', 28.358161753357237), ('temp_max(C)', 30.092066676956406), ('temp_min(C)', 26.536656891495497), ('humidity(%)', 64.912872356845256), ('pressure(hPa)', 1011.640762463343)]
[('temp(C)', 17.652799069561361), ('temp_max(C)', 20.24767390341167), ('temp_min(C)', 14.713225520602567), ('humidity(%)', 71.591603898981461), ('pressure(hPa)', 1017.552503322995)]
[('temp(C)', 19.413540181997501), ('temp_max(C)', 21.428023243065567), ('temp_min(C)', 17.47703102729978), ('humidity(%)', 72.516719657932867), ('pressure(hPa)', 1013.1904396447758)]
[('temp(C)', 23.917136056364882), ('temp_max(C)', 25.436637151290117), ('temp_min(C)', 22.161096829477309), ('humidity(%)', 47.22088926973268), ('pressure(hPa)', 1015.8051033038178)]
[('Clouds', 0.42483407933322814), ('Clear', 0.29587899367188686), ('Rain', 0.11236301898439166), ('Mist', 0.0603488192622329), ('Thunderstorm', 0.060040129649647056), ('Haze', 0.03897206359006563), ('Drizzle', 0.0075628955085656176), ('Fog', 6.6570013390609972e-17)]
[(

#### Inspection of Weather Data Clusters Based on Heading-Data Key-Value Pairs

Well, the text of the output above looks really messy. Let us see what we can derive out of that by deconstructing this manually...

---

[('temp(C)', 28.358161753357237), ('temp_max(C)', 30.092066676956406), ('temp_min(C)', 26.536656891495497), ('humidity(%)', 64.912872356845256), ('pressure(hPa)', 1011.640762463343)]      

    ^ Hot & moderately humid ^

[('temp(C)', 17.652799069561361), ('temp_max(C)', 20.24767390341167), ('temp_min(C)', 14.713225520602567), ('humidity(%)', 71.591603898981461), ('pressure(hPa)', 1017.552503322995)]      
    
    ^ Cool & humid ^
    
[('temp(C)', 19.413540181997501), ('temp_max(C)', 21.428023243065567), ('temp_min(C)', 17.47703102729978), ('humidity(%)', 72.516719657932867), ('pressure(hPa)', 1013.1904396447758)]     
    
    ^ Cool & humid ^

[('temp(C)', 23.917136056364882), ('temp_max(C)', 25.436637151290117), ('temp_min(C)', 22.161096829477309), ('humidity(%)', 47.22088926973268), ('pressure(hPa)', 1015.8051033038178)]      

    ^ Temperate & somewhat humid ^

---

[('Clouds', 0.42483407933322814), ('Clear', 0.29587899367188686), ('Rain', 0.11236301898439166), ('Mist', 0.0603488192622329), ('Thunderstorm', 0.060040129649647056), ('Haze', 0.03897206359006563), ('Drizzle', 0.0075628955085656176), ('Fog', 6.6570013390609972e-17)]              

    ^ Cloudy mixed with clear; some rain ^
    
[('Clear', 0.63181214000880948), ('Clouds', 0.23704031900751887), ('Rain', 0.054607886575096361), ('Mist', 0.051063358440408574), ('Drizzle', 0.012627381479839987), ('Haze', 0.0057598582188803787), ('Fog', 0.0056490917146652985), ('Thunderstorm', 0.0014399645547154734)]      

    ^ Clear with some clouds ^

[('Clouds', 0.41322223440413708), ('Haze', 0.25249424405212034), ('Mist', 0.21565617805065346), ('Clear', 0.078938712860399995), ('Rain', 0.036838066001531999), ('Thunderstorm', 0.0026312904286784067), ('Drizzle', 0.00021927420238958967), ('Fog', 6.4184768611141862e-17)]             

    ^ Cloudy and hazy/misty ^

[('Clouds', 0.62715414643433376), ('Clear', 0.33028658478531381), ('Rain', 0.021041607159868025), ('Haze', 0.0086641911834782134), ('Thunderstorm', 0.0059982862039382051), ('Mist', 0.0041892792535477008), ('Drizzle', 0.0026659049795291208), ('Fog', 6.5268970783627367e-17)]              

    ^ Cloudy mixed with clear; almost no rain ^

---

[('sunrise(h)', 8.2226783968729347), ('sunset(h)', 21.231486340483009), ('wind(m/s)', 3.1636834388022761)]
[('sunrise(h)', 6.7295894254917332), ('sunset(h)', 20.08421577314267), ('wind(m/s)', 2.7532853345148331)]
[('sunrise(h)', 9.327542667106151), ('sunset(h)', 22.581365347364482), ('wind(m/s)', 2.6494967657054977)]
[('sunrise(h)', 6.6556142689400941), ('sunset(h)', 20.001413881738397), ('wind(m/s)', 3.2673674188326993)]

---

[('cloudiness(%)', 39.769563204197603)]       <--- Mostly clear with clouds

[('cloudiness(%)', 18.831745680103875)]       <--- Mostly clear

[('cloudiness(%)', 64.392829733579902)]       <--- Mostly cloudy with some clearness

[('cloudiness(%)', 34.984099781014478)]       <--- Mostly clear with clouds

--- 


Based on the manual observations below, we clasify weather data into the four following categories:
    * Type 0: hot, quite humid, mix of clear & cloudy; some rain
    * Type 1: cool, humid, mostly clear
    * Type 2: cool, humid, cloudy/misty/hazy
    * Type 3: temperate, slightly humid, mix of clear & cloudy, ~0 rain

#### Elbow Method Plot for Optimal k-Means Clustering of Weather Data

Here is the Elbow Method plot justifying the four-fold classification of weather types:

![title](data/Weather_kMeans_ElbowMethod.svg)

### Tweet Sentiments and Circadian Rhythm

It is now time to start synthesising the tweet data with the weather data. We start off by looking at the journal phase and the tweet sentiments. 

We note that the Twitter dataset was gathered between August 24th and 25th, 2017, both of them weekdays. 

As a preliminary form of analysis, let us plot mean and standard deviations of sentiment scores as functions of the journal phase. We use a 100-point sample of the journal phase, each point effectively corresponding to a little less than fifteen minutes of Twitter data (24 hours in the day divided into 100 samples).

In [14]:
'''
================================================================
	Analysis: Tweet Sentiments vs. Circadian Rhythm
================================================================
'''

num_divs_phase24h = 100
delta_phase24h = 2*np.pi/num_divs_phase24h
array_index_phase24h = np.array(np.ceil((array_phase24h[:,0] + np.pi)/delta_phase24h)-1, dtype=int)

nested_list_AFINN = []
nested_list_comp = []
nested_list_neg = []
nested_list_neu = []
nested_list_pos = []

array_AFINN = array_sentiments[:,0]
array_comp = array_sentiments[:,1]
array_neg = array_sentiments[:,2]
array_neu = array_sentiments[:,3]
array_pos = array_sentiments[:,4]                                            

for i_phase24h in range(num_divs_phase24h):                                         
	nested_list_AFINN.append([])             
	nested_list_comp.append([])                              
	nested_list_neg.append([])                             
	nested_list_neu.append([])                           
	nested_list_pos.append([]) 

for i_data in range(dataframe_full.values.shape[0]):                                
	i_phase24h = array_index_phase24h[i_data]
	nested_list_AFINN[i_phase24h].append(array_AFINN[i_data])
	nested_list_comp[i_phase24h].append(array_comp[i_data])
	nested_list_neg[i_phase24h].append(array_neg[i_data])
	nested_list_neu[i_phase24h].append(array_neu[i_data])
	nested_list_pos[i_phase24h].append(array_pos[i_data])

array_sample_phase24h = np.arange(-np.pi+0.5*delta_phase24h, +np.pi, delta_phase24h)

list_avg_AFINN = []
list_avg_comp = []
list_avg_neg = []
list_avg_neu = []
list_avg_pos = []

list_stdev_AFINN = []
list_stdev_comp = []
list_stdev_neg = []
list_stdev_neu = []
list_stdev_pos = []

for i_phase24h in range(num_divs_phase24h):
	list_avg_AFINN.append(np.mean(nested_list_AFINN[i_phase24h]))
	list_avg_comp.append(np.mean(nested_list_comp[i_phase24h]))
	list_avg_neg.append(np.mean(nested_list_neg[i_phase24h]))
	list_avg_neu.append(np.mean(nested_list_neu[i_phase24h]))
	list_avg_pos.append(np.mean(nested_list_pos[i_phase24h]))
	list_stdev_AFINN.append(np.std(nested_list_AFINN[i_phase24h]))
	list_stdev_comp.append(np.std(nested_list_comp[i_phase24h]))
	list_stdev_neg.append(np.std(nested_list_neg[i_phase24h]))
	list_stdev_neu.append(np.std(nested_list_neu[i_phase24h]))
	list_stdev_pos.append(np.std(nested_list_pos[i_phase24h]))
	
phase24h_avg_fgr = plt.figure()
phase24h_avg_fgr.clf()
phase24h_avg_plt = phase24h_avg_fgr.add_subplot(1,1,1)
phase24h_avg_plt.plot(array_sample_phase24h, np.array(list_avg_AFINN), linestyle='-', marker='o', color='c', label="AFINN")
phase24h_avg_plt.plot(array_sample_phase24h, np.array(list_avg_comp), linestyle='-', marker='o', color='g', label="comp")
phase24h_avg_plt.plot(array_sample_phase24h, np.array(list_avg_neg), linestyle='-', marker='o', color='b', label="neg")
phase24h_avg_plt.plot(array_sample_phase24h, np.array(list_avg_neu), linestyle='-', marker='o', color='k', label="neu")
phase24h_avg_plt.plot(array_sample_phase24h, np.array(list_avg_pos), linestyle='-', marker='o', color='y', label="pos")
phase24h_avg_plt.set_xlabel("Journal Phase, $\delta$ [radians]")
phase24h_avg_plt.set_ylabel("Average Sentiment Score, $\overline{\Sigma}$ [dimensionless]")
phase24h_avg_plt.legend()
phase24h_avg_fgr.savefig(getcwd()+"/data/AverageSentiment_CircadianRhythm.svg")

phase24h_stdev_fgr = plt.figure()
phase24h_stdev_fgr.clf()
phase24h_stdev_plt = phase24h_stdev_fgr.add_subplot(1,1,1)
phase24h_stdev_plt.plot(array_sample_phase24h, np.array(list_stdev_AFINN), linestyle='-', marker='o', color='c', label="AFINN")
phase24h_stdev_plt.plot(array_sample_phase24h, np.array(list_stdev_comp), linestyle='-', marker='o', color='g', label="comp")
phase24h_stdev_plt.plot(array_sample_phase24h, np.array(list_stdev_neg), linestyle='-', marker='o', color='b', label="neg")
phase24h_stdev_plt.plot(array_sample_phase24h, np.array(list_stdev_neu), linestyle='-', marker='o', color='k', label="neu")
phase24h_stdev_plt.plot(array_sample_phase24h, np.array(list_stdev_pos), linestyle='-', marker='o', color='y', label="pos")
phase24h_stdev_plt.set_xlabel("Journal Phase, $\delta$ [radians]")
phase24h_stdev_plt.set_ylabel("Standard Deviation of Sentiment Score, $\sigma_{\Sigma}$ [dimensionless]")
phase24h_stdev_plt.legend()
phase24h_stdev_fgr.savefig(getcwd()+"/data/StandardDeviationSentiment_CircadianRhythm.svg")

So, how did the results turn out?

#### Raw Data for Mean Sentiment Scores (Over Respective Sampling Time-Period) vs. Journal Phase

![title](data/AverageSentiment_CircadianRhythm.svg)

#### Raw Data for Standard Deviaition of Sentiment Scores (Over Respective Sampling Time-Period) vs. Journal Phase

![title](data/StandardDeviationSentiment_CircadianRhythm.svg)

A few notes: 
    * The AFINN plot is based on the AFINN sentiment score.
    * The other plots are NLTK sentiment scores: compound (comp), negative (neg), neutral (neu), and positive (pos).

The data seems really noisy. =(

Let us try a smoothing filter (nothing fancy: just a nearest-neighbour averaging scheme) to reduce some of the noise from the average sentiment score. [We could also try to use a smaller number of samples for the journal phase, $\delta$.]

In [15]:
def average_smoothing(input_list, window_size=3):
	output_list = []
	for i_data in range(len(input_list)):
		window_array = np.array(range(i_data-window_size, i_data+window_size+1), dtype=int) % len(input_list)
		sum_window = 0.0
		for i_window in window_array:
			sum_window += input_list[i_window]
		output_list.append( sum_window / (2*window_size + 1) )
	return np.array(output_list, dtype=float)

phase24h_avg_fgr = plt.figure()
phase24h_avg_fgr.clf()
phase24h_avg_plt = phase24h_avg_fgr.add_subplot(1,1,1)
phase24h_avg_plt.plot(array_sample_phase24h, average_smoothing(list_avg_AFINN), linestyle='-', marker='o', color='c', label="AFINN")
phase24h_avg_plt.plot(array_sample_phase24h, average_smoothing(list_avg_comp), linestyle='-', marker='o', color='g', label="comp")
phase24h_avg_plt.plot(array_sample_phase24h, average_smoothing(list_avg_neg), linestyle='-', marker='o', color='b', label="neg")
phase24h_avg_plt.plot(array_sample_phase24h, average_smoothing(list_avg_neu), linestyle='-', marker='o', color='k', label="neu")
phase24h_avg_plt.plot(array_sample_phase24h, average_smoothing(list_avg_pos), linestyle='-', marker='o', color='y', label="pos")
phase24h_avg_plt.set_xlabel("Journal Phase, $\delta$ [radians]")
phase24h_avg_plt.set_ylabel("Smoothed Average Sentiment Score, $\overline{\Sigma}$ [dimensionless]")
phase24h_avg_plt.legend()
phase24h_avg_fgr.savefig(getcwd()+"/data/AverageSentimentSmoothed_CircadianRhythm.svg")

So, how do we fare? 

#### Smoothed Data for Mean Sentiment Scores (Over Respective Sampling Time-Period) vs. Journal Phase

![title](data/AverageSentimentSmoothed_CircadianRhythm.svg)

This seems slightly better. We seem to be seeing a decent progression for the AFINN data, but we are not seeing much from the NLTK scores. Let us plot the AFINN and NLTK compound scores separately.

In [16]:
phase24h_avg_fgr = plt.figure()
phase24h_avg_fgr.clf()
phase24h_avg_plt = phase24h_avg_fgr.add_subplot(1,1,1)
phase24h_avg_plt.plot(array_sample_phase24h, average_smoothing(list_avg_AFINN), linestyle='-', marker='o', color='c', label="AFINN")
phase24h_avg_plt.set_xlabel("Journal Phase, $\delta$ [radians]")
phase24h_avg_plt.set_ylabel("Smoothed Average Sentiment Score, $\overline{\Sigma}$ [dimensionless]")
phase24h_avg_plt.legend()
phase24h_avg_fgr.savefig(getcwd()+"/data/AverageSentimentSmoothed_CircadianRhythm_AFINN.svg")

phase24h_avg_fgr = plt.figure()
phase24h_avg_fgr.clf()
phase24h_avg_plt = phase24h_avg_fgr.add_subplot(1,1,1)
phase24h_avg_plt.plot(array_sample_phase24h, average_smoothing(list_avg_comp), linestyle='-', marker='o', color='g', label="comp")
phase24h_avg_plt.set_xlabel("Journal Phase, $\delta$ [radians]")
phase24h_avg_plt.set_ylabel("Smoothed Average Sentiment Score, $\overline{\Sigma}$ [dimensionless]")
phase24h_avg_plt.legend()
phase24h_avg_fgr.savefig(getcwd()+"/data/AverageSentimentSmoothed_CircadianRhythm_compNLTK.svg")

#### Smoothed Data for Mean AFINN Sentiment Scores (Over Respective Sampling Time-Period) vs. Journal Phase

![title](data/AverageSentimentSmoothed_CircadianRhythm_AFINN.svg)

#### Smoothed Data for Mean Compound NLTK Sentiment Scores (Over Respective Sampling Time-Period) vs. Journal Phase

![title](data/AverageSentimentSmoothed_CircadianRhythm_compNLTK.svg)


It is reassuring to see that the NLTK compoud scores follow the AFINN score trends. =)

Let us take a closer look at the data...

### The Mood-Changing Effects of Sunrise, Sunset, and Midnight

For $\delta\,\in\,\left(-\pi,\,-\frac{2}{3}\pi\right)$, there seems to be a spike in the tweet sentiment positivity. This is probabbly due to the after-work relaxation tweeting on a weekday before bedtime. 

For $\delta\,\approx\,-\frac{1}{2}\pi$, the lowest point of the tweet sentiment scores occur. We are tempted to ask: Are people more likely to be depressed in the middle of the night?

This is followed by a rise in positivity, then by annother slump.

We also notice a rise in positivity of the tweet scores for $\delta\,\approx\,0$, i.e. near sunrise. This could be of interest from a circadian rhythm perspective. 

The tweet positivity levels fall as we progress further into the day, to pick up later (perhaps, after lunch-break and in anticipation of the end of the workday?). 

There seems to be a rise is positivity level, perhaps, at the end of the workday for most people.

Lastly, we note that tweet sentiments are at a low at sunset, i.e. for $\delta\,\approx\,\pm\pi$.

The data seems to suggest that circadian rhythm may play an important role in the sentiment of Twitter data text. Sunrise seems to be a time of positive sentiments, whereas sunset and the middle of the night seem to be times of negativity.

### Some Principal Component Analysis on the Weather Data

There seems to be a very large set of weather data parameters, but it stands to reason that not all of these would be relevant. This is a good time to try a Principal Component Analysis to determine what the key players are in shaping of the weather of the United States for the period under scrutiny.

Using our old technique of merging headers and data components, we will also try to come up with some explanation of what is the most significant amongst the 19 weather data parameters acquired.

In [17]:
'''
=========================================================
	Weather Data Principal Component Analysis
=========================================================
'''

pca_elbow_method_plot(array_weather, 19, getcwd()+"/data/PCA_WeatherData.svg")
weather_components_PCA = pca_model(array_weather, 5)[1] 
list_headers_weather = list_headers_weather0 + list(dict_weather1.keys()) + list_headers_weather2 + list_headers_weather3 + list_headers_day_night

weather_components_explained = []
for i_component in range(weather_components_PCA.shape[0]):
	component_vector = weather_components_PCA[i_component]
	explained_component_raw = sorted([(list_headers_weather[i_basis], np.abs(component_vector[i_basis]), np.abs(component_vector[i_basis])/component_vector[i_basis]) for i_basis in range(weather_components_PCA.shape[1])], key=lambda x: x[1], reverse=True)
	explained_component = [(explained_component_raw[i_tuple][0], explained_component_raw[i_tuple][1]*explained_component_raw[i_tuple][2]) for i_tuple in range(len(explained_component_raw))]
	weather_components_explained.append(explained_component)
print weather_components_explained

[[('temp(C)', 0.5180331167630976), ('temp_max(C)', 0.51152783034572524), ('temp_min(C)', 0.50753338754022204), ('pressure(hPa)', -0.34315047073388766), ('humidity(%)', -0.23400017487206859), ('day(_)', 0.096083256417524229), ('night(_)', -0.096068499777244132), ('sunrise(h)', 0.08981705761092601), ('wind(m/s)', 0.088185697662805054), ('Clouds', 0.044348705784236116), ('sunset(h)', 0.036816038178388799), ('Clear', -0.034478424033643512), ('Mist', -0.02262809374845005), ('cloudiness(%)', 0.018393007915670841), ('Thunderstorm', 0.010034662349665865), ('Rain', 0.009191869494754595), ('Haze', -0.0054419375758603273), ('Fog', -0.0011773642586568572), ('Drizzle', 0.00015058198795230422)], [('sunrise(h)', 0.59352838441785971), ('sunset(h)', 0.58111428140381938), ('cloudiness(%)', 0.3295353838940176), ('humidity(%)', 0.29539681519763117), ('pressure(hPa)', -0.26597435073218273), ('Clear', -0.12555515318540653), ('wind(m/s)', -0.092434749373820596), ('temp(C)', -0.068330263721813833), ('temp_max

#### Inspection of Principal Components of Weather Based on Header-Data Key-Value Pairings

Let us try to sift through the mess of the output text by adding some extra whitespace at the necessary places...


---

[
[('temp(C)', 0.51803311676311825), ('temp_max(C)', 0.51152783034569516), ('temp_min(C)', 0.50753338754022181), ('pressure(hPa)', -0.34315047073390426), ('humidity(%)', -0.23400017487209315), ('day(_)', 0.096083256417510227), ('night(_)', -0.096068499777248698), ('sunrise(h)', 0.089817057610875786), ('wind(m/s)', 0.08818569766281921), ('Clouds', 0.044348705784233361), ('sunset(h)', 0.03681603817834616), ('Clear', -0.034478424033634401), ('Mist', -0.022628093748451261), ('cloudiness(%)', 0.018393007915639192), ('Thunderstorm', 0.010034662349665709), ('Rain', 0.0091918694947534917), ('Haze', -0.0054419375758630942), ('Fog', -0.0011773642586568346), ('Drizzle', 0.00015058198795306671)],      

    ^ "Perceived temperature" (humidex, wind-chill, etc) ^

[('sunrise(h)', 0.5935283844178727), ('sunset(h)', 0.5811142814037783), ('cloudiness(%)', 0.32953538389399872), ('humidity(%)', 0.29539681519764799), ('pressure(hPa)', -0.26597435073225795), ('Clear', -0.12555515318540794), ('wind(m/s)', -0.092434749373840247), ('temp(C)', -0.068330263721767925), ('temp_max(C)', -0.066143004169659458), ('Haze', 0.062164203545594335), ('Mist', 0.054635085893178149), ('temp_min(C)', -0.047861324099138611), ('night(_)', 0.021001059972511123), ('day(_)', -0.02098959030914066), ('Rain', 0.0070297485033139389), ('Thunderstorm', 0.0031987317196776658), ('Clouds', -0.0012105558479702135), ('Drizzle', -0.00015327253200441893), ('Fog', -0.00010878809637685213)],	   

    ^ "Light exposure & air/water interaction" axis ^

[('wind(m/s)', 0.65453314587331901), ('cloudiness(%)', 0.53415043208683632), ('pressure(hPa)', 0.29458711674661014), ('Clouds', 0.2741383660171155), ('Clear', -0.25823178574262967), ('humidity(%)', -0.14604680920516366), ('night(_)', -0.1198424476448092), ('day(_)', 0.11983495039391612), ('temp_max(C)', -0.055291715647165555), ('Haze', -0.042897337906536985), ('sunrise(h)', -0.03699799307240996), ('Rain', 0.025656543952825725), ('temp(C)', -0.022680156685964167), ('Mist', -0.0090216298984092669), ('Thunderstorm', 0.0067303112482915862), ('temp_min(C)', -0.0039555731116531767), ('Drizzle', 0.0034438376518846491), ('sunset(h)', -0.0026887522950110277), ('Fog', 0.00018169467747615602)],	  

    ^ "Cloud formation" axis 1 (wind and pressure) ^

[('cloudiness(%)', -0.53947147615500024), ('humidity(%)', -0.49868068946454747), ('sunset(h)', 0.39495756948909522), ('wind(m/s)', 0.30916677110888174), ('sunrise(h)', 0.30859833277013315), ('pressure(hPa)', 0.17738390420498362), ('Clear', 0.15627533060536469), ('night(_)', -0.11572145736716272), ('day(_)', 0.11569607789180718), ('temp_min(C)', -0.1068869387438596), ('temp(C)', -0.081589809815525896), ('Clouds', -0.070231182142093063), ('temp_max(C)', -0.065061252147040297), ('Mist', -0.045020281854660435), ('Rain', -0.038529584554613291), ('Drizzle', -0.0055913409802464059), ('Haze', 0.0047119898113903366), ('Fog', -0.0018798326377839857), ('Thunderstorm', 0.00026490175263674731)],	     

    ^ "Cloud formation" axis 2 (humidity, wind, and light) ^

[('wind(m/s)', 0.66126297057486971), ('humidity(%)', 0.5385866575350049), ('cloudiness(%)', -0.29471139439723354), ('pressure(hPa)', -0.27198932186246566), ('day(_)', -0.16037598460692196), ('night(_)', 0.16034245431034738), ('Clouds', -0.15454997233896151), ('Clear', 0.13635249031825808), ('sunset(h)', -0.07831470716211425), ('temp_max(C)', 0.071669323178983987), ('Haze', -0.055409026713919965), ('Mist', 0.033258939066420003), ('temp_min(C)', -0.033241439516068047), ('Thunderstorm', 0.024223857309966325), ('sunrise(h)', -0.020114043430359701), ('temp(C)', 0.013817555947356177), ('Rain', 0.013305485310842698), ('Drizzle', 0.001931801457181), ('Fog', 0.00088642559023209761)] 	

    ^ "Cloud formation" axis 3 (wind and humidity) ^

---

There seem to be five principal components of weather data based on an 85% cut-off for the explained variance in PCA:
    1. Perceived temperature: temperature and humidity
    2. Sunlight and air/water interaction: humidity, pressure, etc.
    3. Cloud formation:
        a. Wind and pressure
        b. Humidity, wind, and sunlight
        c. Wind and humidity
        
Lastly, we provide a plot for the explained variances for the PCA.

#### Explained Variance Plot for PCA of Weather Data

![title](data/PCA_WeatherData.svg)

### Geographical Distribution of Weather and Negative-Sentiment Tweets

We propose to look at the proportion of negative-sentiment tweets as a consequence of geographical location and weather type. We tally up the negative-sentiment tweets for the various weather types identfied above for every US city studied. We only consider the location and weather combination for which at least 10 tweets have been recorded. 

The proportion of negative-sentiment tweets in a sample is represented by the colour of the scatter plot. The shape of the marker represents the regional classification (resultant from k-Means Clustering of location coordinates) and the size of the marker is representative of the sample-size of the tweets.

In [28]:
'''
=========================================================================================================
	Scatter Plots of Negative-Sentiment Tweets -- Georgraphical Distributions by Weather Type
=========================================================================================================
'''

# Dictionary Structure with City Data: Key = City Name, Value = [State, (Latitude, Longitude), UTC_offset, OpenWeatherMap_city_code]
dict_city_coordinate =	{"new york": (40.6643, -73.9385),
			"los angeles": (34.0194, -118.4108),
			"chicago": (41.8376, -87.6818),
			"houston": (29.7805, -95.3863),
			"phoenix": (33.5722, -112.0880),
			"philadelphia": (40.0094, -75.1333),
			"san antonio": (29.4724, -98.5251),
			"san diego": (32.8153, -117.1350),
			"dallas": (32.7757, -96.7967),
			"san Jose": (37.2969, -121.8193),
			"austin": (30.3072, -97.7560),
			"jacksonville": (30.3370, -81.6613),
			"san francisco": (37.7751, -122.4193),
			"columbus": (39.9848, -82.9850),
			"indianapolis": (39.7767, -86.1459),
			"fort worth": (32.7795, -97.3463),
			"charlotte": (35.2087, -80.8307),
			"seattle": (47.6205, -122.3509),
			"denver": (39.7618, -104.8806),
			"el paso": (31.8484, -106.4270),
			"washington": (38.9041, -77.0171),
			"boston": (42.3320, -71.0202),
			"detroit": (42.3830, -83.1022),
			"nashville": (36.1718, -86.7850),
			"memphis": (35.1035, -89.9785),
			"portland, or": (45.5370, -122.6500),	# Special city key: to differentiate entry from Portland, ME
			"oklahoma city": (35.4671, -97.5137),
			"las vegas": (36.2277, -115.2640),
			"louisville": (38.1781, -85.6667),
			"baltimore": (39.3002, -76.6105)}

def find_city_key(coordinates, city_coordinate_dict):
	for key in city_coordinate_dict.keys():
		if (city_coordinate_dict[key][0] == coordinates[0]) and (city_coordinate_dict[key][1] == coordinates[1]):
			return key
	return None

# Dictionary of Sentiment Data by City
# * City name as key
# * For each weather type, counts of negative-sentiment tweets and all tweets to be tallied
# * Each city to contain region class
dict_city_wthr_sntmt =	{"new york": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"los angeles": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"chicago": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"houston": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"phoenix": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"philadelphia": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"san antonio": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"san diego": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"dallas": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"san Jose": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"austin": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"jacksonville": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"san francisco": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"columbus": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"indianapolis": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"fort worth": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"charlotte": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"seattle": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"denver": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"el paso": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"washington": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"boston": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"detroit": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"nashville": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"memphis": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"portland, or": [np.zeros((kBest_weather, 2), dtype=float), -1],	# Special city key: to differentiate entry from Portland, ME
			"oklahoma city": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"las vegas": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"louisville": [np.zeros((kBest_weather, 2), dtype=float), -1],
			"baltimore": [np.zeros((kBest_weather, 2), dtype=float), -1]}

#array_coordinates_scaled = scaler_coordinates.fit_transform(array_coordinates)
#kMeans_coordinate_all_labels = kMeans_model(array_coordinates_scaled, kBest_coordinates)[1]

# Updating Region Labels for City-Weather-Sentiment Dictionary
for i_coord in range(len(array_coordinates_unique)):
	key = find_city_key(array_coordinates_unique[i_coord], dict_city_coordinate)
	if key != None:
		dict_city_wthr_sntmt[key][1] = array_kMeans_coordinate_labels[i_coord]	
				
kMeans_weather_labels = kMeans_weather[1]
#print "# Diagnostic: kMeans_weather_labels = "
#print kMeans_weather_labels

for i_data in range(dataframe_full.values.shape[0]):
	city_coordinates = array_coordinates[i_data]
	city_key = find_city_key(city_coordinates, dict_city_coordinate)
#	print "# Diagnostic: city_key = "+str(city_key)
	weather_label = kMeans_weather_labels[i_data]
#	print "# Diagnostic: weather_label = "+str(weather_label)
	if (city_key != None):
		dict_city_wthr_sntmt[city_key][0][weather_label][1] += 1	# Incrementing total tweet count for city + weather-type conbination
		if (array_AFINN[i_data] < 0) or (array_comp[i_data] < 0):
			dict_city_wthr_sntmt[city_key][0][weather_label][0] += 1	# Incrementing negative-sentimet tweet count for city + weather-type conbination

list_plot_markers = ['o', 'D', 's']

data_threshold = 10
list_neg_prop = []
for city_key in dict_city_wthr_sntmt.keys():
	for i_weather in range(kBest_weather):
		if (dict_city_wthr_sntmt[city_key][0][i_weather][1] > data_threshold):		
			list_neg_prop.append(dict_city_wthr_sntmt[city_key][0][i_weather][0] / dict_city_wthr_sntmt[city_key][0][i_weather][1])	
min_neg_prop = min(list_neg_prop)
max_neg_prop = max(list_neg_prop)

for i_weather in range(kBest_weather):
	bg_map = create_background_map()
	for i_region in range(kBest_coordinates):
		for city_key in dict_city_wthr_sntmt.keys():
			if (dict_city_wthr_sntmt[city_key][1] == i_region) and (dict_city_wthr_sntmt[city_key][0][i_weather][1] > data_threshold):
				neg_tweet_proportion = dict_city_wthr_sntmt[city_key][0][i_weather][0] / dict_city_wthr_sntmt[city_key][0][i_weather][1]
				x, y = bg_map(dict_city_coordinate[city_key][1], dict_city_coordinate[city_key][0])
				sct_plt0 = bg_map.scatter(x, y, c=neg_tweet_proportion, edgecolors='r', s=200.0+dict_city_wthr_sntmt[city_key][0][i_weather][1], marker=list_plot_markers[i_region], vmin=min_neg_prop, vmax=max_neg_prop, cmap=cm.viridis_r, alpha=1.0)
				sct_plt1 = bg_map.scatter(x, y, color='r', s=20.0, marker='o', alpha=1.0)
	plt.colorbar(sct_plt0)
	plt.savefig(getcwd()+"/data/WeatherType"+str(i_weather)+"_Sentiments.svg")


#### The West Coast is Happier, the East Coast is Happier by Day, and the South is Unhappy

We start off by looking at Type 0 weather:

##### Type 0 Weather -- The Hot and Dry South

![title](data/WeatherType0_Sentiments.svg)

We notice that the hot, dry weather characteristic of Type 0 is seen mostly in cities of the South region (diamond-shaped markers on the plot above). 

##### Type 1 Weather -- Cool, Humid, and Clear Nights of the East Coast and Mid-West

Type 1 weather (cool, humid, mostly clear) is found in the East Coast and Mid-West region (circular markers):

![title](data/WeatherType1_Sentiments.svg)

##### Type 2 Weather -- Cool and Misty West Coast Ambience

Type 2 weather (cool, misty/hazy) is found in the West Coast and Rocky Mountain regions (square markers):

![title](data/WeatherType2_Sentiments.svg)

It is interesting how the fabled fog in cities like San Francisco have made it to our dataset. =)

##### Type 3 Weather -- Temperate, Humid, and Clear Days of the East Coast and Mid-West

Lastly, Type 3 (temperate, slightly humid, mostly clear) weather is also characteristic of the East Coast and Mid-West. In fact, it stands to reason that Type 3 is daytime weather for this region, whereas Type 1 is night-time weather.

![title](data/WeatherType3_Sentiments.svg)

##### Notes on the Colour Scheme and an Overall Picture of "Happiness" in America

We note that a colour of yellow is associated with a happier sentiment, whereas a colour of blue with a sadder one (playing on the old stereotype of sunny = happy and blue = sad). 

Based on the pictures above, the South seems to be in the most negative state of mind, whereas the West Coast seems to be the happiest. The latter seems to be a affirmation of a social stereotype of places like California being happier than the rest of USA. 

The East Coast and Mid-West seems to be happier during the day (Type 3 weather) than at night (Type 1 weather), seemingly corroborating what was observed in the analysis of the circadian rhythm study.

### Topic Models of the Sentiments -- Positive, Negaive, and Neutral

We look at using Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorisation (NMF) topic models of positive, negative, and neutral tweets. We do not analyse location-specific details here, but merge the positive, negative, and neutral tweets from various locations into single data files.

For brevity, we look only at the NMF models here.


In [30]:
'''
=====================================================
	Analysis of Textual Data from Twitter
=====================================================
'''

def load_text_data_to_list(data_file_name):
	data_list = []
	if path.exists(getcwd()+'/'+data_file_name):
		data_file = open(getcwd()+'/'+data_file_name, 'r')
		for line in data_file:
			data_list.append(line)
		data_file.close()
	else:
		print "# Warning [load_text_data_to_list(...)]: File \'"+data_file_name+"\' not found in current directory ("+getcwd()+')'
	return data_list

list_neg = load_text_data_to_list("neg.txt")
list_pos = load_text_data_to_list("pos.txt")
list_neu = load_text_data_to_list("neu.txt")

def process_nmf(data_list, num_features=1000, num_topics=20, num_top_words=10):
	nmf_model_instance, nmf_feature_names = nmf_model(data_list, num_features, num_topics)
	print_top_words(nmf_model_instance, nmf_feature_names, num_top_words)

def process_lda(data_list, num_features=1000, num_topics=20, num_top_words=20):
	lda_model_instance, lda_feature_names = lda_model(data_list, num_features, num_topics)
	print_top_words(lda_model_instance, lda_feature_names, num_top_words)
	
print "# Negative Topics (NMF)..."
process_nmf(list_neg)
print "# Positive Topics (NMF)..."
process_nmf(list_pos)
print "# Neutral Topics (NMF)..."
process_nmf(list_neu)

# Negative Topics (NMF)...
Topic #0:
https new dead real bad miss lose video sorry fake
Topic #1:
rt say hell life day new fake niggas miss got
Topic #2:
fuck 12 zwy5p3pm2x chikin eat juice2wavy yes like wh0rex flying
Topic #3:
shit finna ain holy stay tired doing two__xs ivs3nov2gt clearly
Topic #4:
people think wrong say really buy right making smart black
Topic #5:
bitch real 4w7wdmlwsf leave tryna lil surprise juanii___g giooplbhf2 bout
Topic #6:
just want did leave wanna tryna remember 4w7wdmlwsf said picture
Topic #7:
trump donald twitter mr ban president blocking clowns stephenking blocked
Topic #8:
amp man realdonaldtrump burning getting played thing game jersey ridiculous
Topic #9:
ass yo like finna dead tired named met girl fat
Topic #10:
im crying anime opening spongebob q8ew4evuui marynicolexx going rt ugly
Topic #11:
don like want niggas let really bother come hoes understand
Topic #12:
stop need watching pics wyd multiple playing didn girl drop
Topic #13:
fucking kfvilj1a

#### A Pop Star, A President, and a Hurricane -- Topic Models of a Nation's Tweets

Negative tweets seem to consist of popular culture items -- including what seems to be pornographic references =S -- and some political rants. Popular culture and politics also seem to be key contributors to positive-sentiment tweets. The notable difference in the neutral-sentiment tweets has to do with the presence of the topic of Hurricane Harvey.

In fact, Hirrican Harvey could be a factor behind the South feeling more negative than the rest of the country.

### A Flask App for Presenting Topic Models by City

We realise there is simply too much data to be presented for topic models of tweets. To make matters more tractable, we opt to create an HTML file for every city using a small Python sript, such that we can use the resultant files to create a small web app in Flask. 

We incorporate the following items into each HTML file:
    1. Word clouds for the positive, negative, and neutral tweets for each city
    2. NMF and LDA topic model results    

In [31]:
'''
============================================================
	HTML Pages for Topic Model Data Presentation	
============================================================
'''

from AncilliaeHTML import *

def process_nmf_html(html_file, h2_text, data_list, num_features=1000, num_topics=20, num_top_words=10):
	nmf_model_instance, nmf_feature_names = nmf_model(data_list, num_features, num_topics)
	add_html_h2(html_file, h2_text)
	top_words_html(nmf_model_instance, nmf_feature_names, num_top_words, html_file)	

def process_lda_html(html_file, h2_text, data_list, num_features=1000, num_topics=20, num_top_words=10):
	lda_model_instance, lda_feature_names = lda_model(data_list, num_features, num_topics)
	add_html_h2(html_file, h2_text)
	top_words_html(lda_model_instance, lda_feature_names, num_top_words, html_file)	

def get_text(data_file_name):
	if path.exists(getcwd()+'/'+data_file_name):
		return open(getcwd()+'/'+data_file_name).read()
	return None

def create_html_dir(dir_structure):
	if dir_structure[0] == '/':
		dir_structure = dir_structure[1:]
	if dir_structure[-1] == '/':
		dir_structure = dir_structure[:-1]
	dir_names = dir_structure.split('/')
	current_dir = getcwd()
	for dir_name in dir_names:
		current_dir += '/' + dir_name
		if not(path.exists(current_dir)):
			mkdir(current_dir)

flask_app_main_dir = "/FlaskApp/"
flask_app_pages_subdir = "pages/"
flask_app_images_subdir = "images/"
flask_app_pages_dir = flask_app_main_dir + flask_app_pages_subdir
flask_app_images_dir = flask_app_main_dir + flask_app_pages_subdir + flask_app_images_subdir

create_html_dir(flask_app_pages_dir)
create_html_dir(flask_app_images_dir)

index_html_file_name = "TopicIndex.html"
index_html_file = init_html_file(getcwd()+flask_app_main_dir+index_html_file_name, "Index Page -- Tweet Sentiment Topics by City")	

for city in dict_city_coordinate.keys():
	if city == "san Jose":
		continue
	name_prefix = city.replace(' ', '_').replace(',', '')
	
#	flask_app_pages_dir = flask_app_pages_dir+name_prefix
		
	
	list_neg = load_text_data_to_list(name_prefix+"_full__neg.txt")
	list_pos = load_text_data_to_list(name_prefix+"_full__pos.txt")
	list_neu = load_text_data_to_list(name_prefix+"_full__neu.txt")

	text_neg = get_text(name_prefix+"_full__neg.txt")
	text_pos = get_text(name_prefix+"_full__pos.txt")
	text_neu = get_text(name_prefix+"_full__neu.txt") 

	city_html_file = init_html_file(getcwd()+flask_app_pages_dir+'/'+name_prefix, city)

	create_wordcloud_image(text_neg, getcwd()+flask_app_images_dir+'/'+"WordCloud__neg__"+name_prefix+".png")
	add_html_h2(city_html_file, "Word Cloud -- Negative Tweets")
	add_html_image(flask_app_images_subdir+"WordCloud__neg__"+name_prefix, city_html_file)	#add_html_image(flask_app_images_subdir+"WordCloud__neg__"+name_prefix+".png", city_html_file)
	create_wordcloud_image(text_pos, getcwd()+flask_app_images_dir+'/'+"WordCloud__pos__"+name_prefix+".png")
	add_html_h2(city_html_file, "Word Cloud -- Positive Tweets")
	add_html_image(flask_app_images_subdir+"WordCloud__pos__"+name_prefix, city_html_file)	# add_html_image(flask_app_images_subdir+"WordCloud__pos__"+name_prefix+".png", city_html_file)
	create_wordcloud_image(text_neu, getcwd()+flask_app_images_dir+'/'+"WordCloud__neu__"+name_prefix+".png")
	add_html_h2(city_html_file, "Word Cloud -- Neutral Tweets")
	add_html_image(flask_app_images_subdir+"WordCloud__neu__"+name_prefix, city_html_file)	# add_html_image(flask_app_images_subdir+"WordCloud__neu__"+name_prefix+".png", city_html_file)

	process_nmf_html(city_html_file, "NMF Model -- Negative Tweets", list_neg)
	process_nmf_html(city_html_file, "NMF Model -- Positive Tweets", list_pos)
	process_nmf_html(city_html_file, "NMF Model -- Neutral Tweets", list_neu)

	process_lda_html(city_html_file, "LDA Model -- Negative Tweets", list_neg)
	process_lda_html(city_html_file, "LDA Model -- Positive Tweets", list_pos)
	process_lda_html(city_html_file, "LDA Model -- Neutral Tweets", list_neu)
	
	end_html_file(city_html_file)
	add_html_link(index_html_file, flask_app_pages_subdir+name_prefix, city)

end_html_file(index_html_file)

#### Sample Word Cloud Image -- Neutral-Sentiment Tweets for Houston

We present a sample word cloud for Houston, TX, where a visual representation is provided for the words used in neutral-sentiment tweets recorded for that city.

![title](FlaskApp/pages/images/WordCloud__neu__houston.png)

As may be expected, Hurricane Harvey seems to be taking centre-stage.


## Discussion -- Categories of Weather in USA, Impacts of Circadian Rhythm on Tweet Sentiment, Regional Variations of Positivity/Negativity, and Hot Topics

Here, we attempt to answer the questions we had set out to investigate 

    1. We find that weather in the United States could be categorised, using a k-Means Clustering Method, into four different classes for August 24th & 25th, 2017:
        * We further find that these correspond roughly to three different geographical regions, which themselves can be obtained via k-Means Clusters.
        * There seems to be five principal components of weather data based on standard Principal Component Analysis, which we interpret to form three distinct classes:
            1. Perceived temperature, including effects of humidity and pressure
            2. Evaporative effects of sunlight and their impact on moisture in the atmosphere
            3. Three different contributors to the dynamics of cloud formation.
    2. We observe that once tweet sentiments are analysed in the context of a "journal phase" parameter (defined and explained above), the data seems to offer interesting insights into positive/negative-sentiment scores observed.
        * Tweets seem to have the lowest sentiment scores in the middle of the night. There could be a correlation between darkness and depression: depressed people may be using social media later into the night, and their depression is most significant at the height of darkness. 
        * Sunset is associated with a plunge in tweet sentiment positivity.
        * Sunrise seems to correspond to a peak in tweet sentiment positivity.
    3. We also undertake analysis of the geographical distribution of negative-sentiment tweets:
        * People on the West Coast seem to be marginally happier.
        * People on the East Coast are happier during the day.
        * People in the South are not as happy, perhaps, due to the effect of Hurricane Harvey (as noted from textual analysis).
    4. Upon applying LDA and NMF topic models to the tweet data, we notice three topics that immediately stand out:
        * President Donald Trump, who has a well-known Twitter presence
        * Taylor Swift, the pop star who had announced a new album one day ago
        * Hurricane Harvey, which was pounding the coast of Texas during that time

## Conclusion -- Towards Using Data Science as a Tool of Modern Psychology and Psychiatry

The challenges posed by mental health issues have been part of the story of humankind for a very long time. The progress of science has helped shape the landscape of psychology and psychaitry throughout human history. From the witch doctors of prehistory, to the ancient Greek and Roman philosophers, to the modern-day figures like Sigmund Freud and Alice Miller, psychology has taken a convoluted course into becoming a science from a pseudo-science. In these exciting times when Data Science is helping quantify the often-qualitative nuances of the Social Sciences, it is hoped that psychology and psychiatry will benefit from psyco-metric analysis, whereby diagnostic and interventional challenges facing modern-day mental health issues can be conquered.