### Arlington 2050 Summary

## Introduction -
The Arlington 2050 project is a year long initiative that strives to get community feedback on what Arlington should look like in 2050 and the challenges we must address to get there. Rachael and I worked specifically with the 'Postcard test tracker dataset'. This specific dataset had data collected from many locations. These ranged from local libraries to community events. Our data was collected strictly on postcards at said locations, whether that be in person or dropped in a box.

## Ingesting and Cleaning Data -
First, we took into account the fact we had hundreds of entries from multiple different locations. Due to the fact we had entries from multiple different locations in the county we had a little extra work when it came to cleaning the data. We needed to make sure that each location had its own dataframe, this would make getting specifics on certain locations easier. We achieved this by creating a new dataframe for each location.

To start lets import pandas

In [None]:
import pandas as pd

Then we need to read the file

In [None]:
ds = pd.read_excel("../datasets/Postcard_text_tracker.xlsx")
ds 

Within this dataset some columns are inaccurately names. To make it more readable we have to change this. 

In [None]:
df = ds.rename(columns={"Unnamed: 2" : "Multiple Entries", "Unnamed: 3" : "Collection Points"})
df

The last thing we have to do to clean the data is separate the entries by each location. This entails creating a new dataframe for each location with their specific entries.

In [None]:
kickoff = df[df["Collection Points"] == "Kickoff"]

column_text = kickoff.loc[:, 'Text']

The code above creates a new dataframe called kickoff. The new dataframe kickoff only contains the entries from the location kickoff. We did this for every location.

After processing and cleaning the data, we began analysis where we had to import a variety of libraries. 

In [None]:
import spacy
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

Our first mini project was creating word clouds with the data. To start, for each locations datafame, we turned the text column into a string.

In [None]:
column_text

text_string = column_text.to_string()

text_string

Once all the text was turned into a string, we can use this to find all the commonly used words.

In [None]:
text = text_string

doc = nlp(text)

#if we want to lemmatize
#words = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

words = [token.text for token in doc if not token.is_stop and not token.is_punct]

Then when we have all of our common words we can visualize it as a word cloud!

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=100, mask=None, contour_width=3, contour_color='steelblue').generate(" ".join(words))

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

You can customize the word cloud above through its width and height, as well as some of its colors.

Some things we noticed in our word clouds was that every location seemed to have different common words. While some would be similar, they would never be perfectly the same.

Some of the most common words ranged from "Arlington", "Community", "Housing" to "Trees", "Green", and "Safe". 

All word clouds had their similarities but their differences stuck out more. It was super interesting how one part of arlington focused on community while the other focused on trees and green!

Next, were going to do sentiment analysis. To start we have to install some libraries.

In [None]:
%%bash
pip3 install nbformat
pip3 install spacytextblob
pip3 install spacy
pip3 install textblob
pip3 install spacy download en_core_web_sm
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0.tar.gz

In [None]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

#Example 1

text = text_string
doc = nlp(text)


doc._.blob.polarity

The code above test for the polarity of the entire dataframe Kickoff (mentioned earlier).

In [None]:

doc._.blob.subjectivity

And this code above tests the subjectivity of the same dataframe.

In the end, we're able to see that while you'd assume that most entries would be positive, not all are. With kickoff having a polarity of 0.2850, it shows us that it is farly positive with some negativity (it ranges from -1 to 1).
Though with subjectivity we're able to see that the entries are partially based on opinion. Subjectivity tells us whether something is opinion or not, ranging from 0 to 1 (objective to subjective).

The last thing we did was test the semantics of the code.

In [None]:
%%capture
%run ./spell_check.ipynb

The code above runs a prewritten spell checker by Mr. Jones on the whole dataframe.

In [None]:
nlp = spacy.load("en_core_web_lg") 
query = "missing middle"
def similarityToQuery(text):
    return nlp(text).similarity(nlp(query))
df_events['similarity_to_query'] = df_events['spell_checked_text'].apply(similarityToQuery)
pd.set_option('display.max_colwidth', None)
print(df_events.sort_values('similarity_to_query', ascending=False).iloc[0]["concatenated_text"])
print(df_events.sort_values('similarity_to_query', ascending=False).iloc[1]["concatenated_text"])
print(df_events.sort_values('similarity_to_query', ascending=False).iloc[2]["concatenated_text"])

In the code above, I did a Semantic Search within my dataset. This specific code returns 3 comments that mention missing middle or are related to missing middle!

## Summary - 
Throughout this project I learned a wide range of topics. This project gave me the chance to solidify my python skills, as well as give me a chance to use them in real world applications. I thought this project was very interesting. It was cool to see the similarities between peoples thoughts and how positive or negative their thoughts were. I think working with survey data is such an interesting topic and should be taught more often. Especially as it gives students a chance to use their skills for real world problems like we did. In the end, this was a very fun project, 10/10 would recomend. 