




Finally, a summary that reflects on this project, what you've learned from it, and what you thought of it. Feel free to discuss whatever you want in this section, you might want to discuss Pandas, Spacy, Arlington, surveys, and/or data science.

### Project 2050 Summary

## Introduction -
The Arlington 2050 project is a year long initiative that strives to get community feedback on what Arlington should look like in 2050 and the challenges we must address to get there. Rachael and I worked specifically with the Postcard test tracker dataset. This specific dataset had data collected from many locations. These ranged from local librarys, to community events. Our data was collected strictly on postcards at said locations, whether that be in person or dropped in a box.

## Ingesting and Cleaning Data -
First, we took into account the fact we had hundreds of entries from multiple different locations. Due to the fact we had entries from multiple different locations in the county we had a little extra work when it came to cleaning the data. We needed to make sure that each location had its own dataframe, this would make getting specifics on certain locations easier. We achieved this by creating a new dataframe for each location.

To start lets import pandas

In [None]:
import pandas as pd

To start lets read the file

In [None]:
ds = pd.read_excel("../datasets/Postcard_text_tracker.xlsx")
ds 

Within this dataset some columns are inaccurately names. To make it more readable we have to change this. 

In [None]:
df = ds.rename(columns={"Unnamed: 2" : "Multiple Entries", "Unnamed: 3" : "Collection Points"})
df

The last thing we have to do to clean the data is seperate the entries by each location. This entails creating a new dataframe for each location with their specific entries.

In [13]:
kickoff = df[df["Collection Points"] == "Kickoff"]

column_text = kickoff.loc[:, 'Text']

The code above creates a new dataframe called kickoff. Kickoff now holds all rows from df that have the location Kickoff. Next, we isolated the text column with the entries so that way that is all the dataframe kickoff contained. We did this for every location.

After processing and cleaning the data, we began analysis where we had to import a variety of libraries. 

In [16]:
import spacy
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

Our first mini project was creating word clouds with the data. To start, for each locations datarame, we turned the text column into a string.

In [14]:
column_text

text_string = column_text.to_string()

text_string

'1      Housing available at all income levels w/ \\now...\n2      because in the mid 2020s Arlington focused on ...\n3      Arlington\'s investment in next gen smart growt...\n4      More multicultural and diversity friendly envi...\n5      Increased regional cooperation. There are hard...\n6      There are less cars on the road and parked in ...\n7      We are the leading model when it comes to high...\n8      Transportation, affordable housing, integrated...\n9      Multigenerational government making decisions....\n10     All of the native trees you planted have grown...\n11                                                   NaN\n12     We have eliminated the disparities that exist,...\n13     I feel safe in my home and every home in my ne...\n14     Arlington has an amazing vibe where there is a...\n15     The wildlife and nature is thriving, as are it...\n16     Arlington invested in infrastructure and land ...\n17     Expanded walkable likable "green" neighborhood...\n18     We h

Once all the text was turned into a string, we can use this to find all the common used words.

In [None]:
text = text_string

doc = nlp(text)

#if we want to lemmatize
#words = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

words = [token.text for token in doc if not token.is_stop and not token.is_punct]

Then when we have all of our common words we can visualize it as a word cloud!

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=100, mask=None, contour_width=3, contour_color='steelblue').generate(" ".join(words))

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

You can customize the word cloud above through its width and height, as well as some of its colors.

Some things we noticed in our word clouds was that every location seemed to have different common words. While some would be similar, they would never be perfectly the same.

Some of the most common words ranged from "Arlington", "Community", "Housing" to "Trees", "Green", and "Safe". 

All word clouds had their similarities but their differences stuck out more. It was super interesting how one part of arlington focused on community while the other focused on trees and green!

Next, were going to do sentiment analysis. To start we have to install some libraries.

In [26]:
%%bash
pip3 install nbformat
pip3 install spacytextblob
pip3 install spacy
pip3 install textblob
pip3 install spacy download en_core_web_sm
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0.tar.gz

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0.tar.gz (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25h  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'


In [30]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

#Example 1

text = text_string
doc = nlp(text)


doc._.blob.polarity

0.2850960298305432

The code above test for the polarity of the entire dataframe Kickoff (mentioned earlier).

In [29]:

doc._.blob.subjectivity

0.5091077015413298

And this code above test the subjectivity of the same dataframe.

In the end, were able to see that while youd assume that most entries would be positive, not all are. With kickoff having a polarity of 0.2850 shows us that its farly positive with some negativity (it ranges from -1 to 1).
Though with subjectivity were able to see that the entries are partially based on opinion. Subjectivity tells us wheather something is opinion or not, ranging from 0 to 1 (objective to subjective).

The last thing we did was test the semantics of the code.

In [21]:
%%capture
%run ./spell_check.ipynb

The code above runs a prewritten spell checker by Mr. Jones on the whole dataframe.

In [22]:
nlp = spacy.load("en_core_web_lg") 
query = "missing middle"
def similarityToQuery(text):
    return nlp(text).similarity(nlp(query))
df_events['similarity_to_query'] = df_events['spell_checked_text'].apply(similarityToQuery)
pd.set_option('display.max_colwidth', None)
print(df_events.sort_values('similarity_to_query', ascending=False).iloc[0]["concatenated_text"])
print(df_events.sort_values('similarity_to_query', ascending=False).iloc[1]["concatenated_text"])
print(df_events.sort_values('similarity_to_query', ascending=False).iloc[2]["concatenated_text"])

  return nlp(text).similarity(nlp(query))


Great neighbors, lots of trees, and hopefully more families taking advantage of missing middle housing!
Missing middle housing. We need representation on the Arlington Board #JDSPAIN
There is a free walk a beagle center on every block with snuggle corners inside, everyone on Earth is at peace thanks to unicorns and lastly, kids and grown ups are equal yay! Plain old 2024 was boring so we built a time machine that runs on smiles to take us to paradise in the year 2050.


In the code above, I did a Semantic Search within my dataset. This specific code returns 3 comments that mention missing middle or are related to missing middle!

## Summary - 
