# Arlington 2050 by Adam Hellinga

The Arlington 2050 project was run by Arlington County, where residents could send in postcards from the future (the year 2050) talking about what improvements have been made to Arlington in that time. For this project, we were tasked with analyzing this data through numbers and visual representations of the data.

All of this code was either written by me, Mr. Jones our teacher, or fellow students who researched these topics in depth to give an expert view on each of these topics.

First, to get this all running, I need to import some modules

In [None]:
import pandas as pd
import spacy
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from spellchecker import SpellChecker
from spacytextblob.spacytextblob import SpacyTextBlob
import seaborn as sns
import numpy as np

## Importing and Cleaning the Data

The data was split into a few excel files based on the collection point, and I analyzed the web input post cards, meaning this were submitted electronically and they were able to send up to 3 post cards as well as writing about how Arlington achieved these goals, however this section was misinterpreted frequently.

Here I input the data set I will be working with.

In [None]:
df = pd.read_excel("Public_Input_Postcards.xlsx")
df

The column names were confusing, so here I rename the columns to make them easier to understand and easier to write out in the code.

In [None]:
df.columns = ['id', 'zip', 'source', 'Card1', 'first_gettinghere','Card2','Card3','zip_selfreported','zip_selfreported2']

In order to properly get word clouds, vectors, and all other aspects of this project all the postcards (one to three cards) needed to be condensed into one column, so here I went through each column with text and checked if it had a value, and if it did it was added to a column that is all the postcards in one.

In [None]:
for i in range(len(df.Card1)):
    if pd.notna(df.loc[i]["Card2"]):
        if pd.notna(df.loc[i]["Card3"]):
            df.loc[i, "Cards"] = str(df.loc[i, "Card1"]) + " " + str(df.loc[i, "Card2"]) + " " + str(df.loc[i, "Card3"])
        else:
            df.loc[i, "Cards"] = str(df.loc[i, "Card1"]) + " " + str(df.loc[i, "Card2"])
    else:
        df.loc[i, "Cards"] = str(df.loc[i, "Card1"])

We needed both all the postcards in one, but we also wanted all of the data in one column, so here they are combined into one. I used the cards column as it is already three of the four columns combined.

In [None]:
for i in range(len(df.Card1)):
    if pd.notna(df.loc[i]["first_gettinghere"]):
        df.loc[i, "All_text"] = str(df.loc[i, "Cards"]) + " " + str(df.loc[i, "first_gettinghere"])
    else:
        df.loc[i, "All_text"] = str(df.loc[i, "Cards"])
    
df

## Putting it into a Word Cloud
This is done before the removal of stop words, as its possible that some important words may be lost in the spell check process.

This word cloud shows how some of the important things that people want to see in 2050 are things like schools, parks, housing, places to walk, and more.

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=100, mask=None, contour_width=3, contour_color='steelblue').generate(" ".join(df.Cards))

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

## Sentiment Analysis
Here we analyze the spolarity, or how positive(1) or negative(-1) a statement is. We also analyze the subjectivity, or how objective(0) or subjective(1) a statement is.

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

This creates two new columns, each with their respective value in them for each statement.

In [None]:
for i in range(len(df.Cards)):
    doc = nlp(df.loc[i]["Cards"])
    df.loc[i, "polarity"] = doc._.blob.polarity
    df.loc[i, "subjectivity"] = doc._.blob.subjectivity
df

Here we plot the polarity of each statement on a graph, so we can see what the most common polarity is. This graph shows that most of these statements are slightly positive or neutral, and the rest are mostly positive.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x=df["polarity"])
plt.title('Polarity of Arlington 2050 Public Input Post Cards (Web)')
plt.xlabel('polarity')
plt.ylabel('count')
plt.grid(False)
plt.show()

Here we do the same thing, but we're graphing the Subjectivity. Here we can see that all the statements are usually a mix of objective and subjective statements, or they are entirely objective.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x=df["subjectivity"])
plt.title('Subjectivity of Arlington 2050 Public Input Post Cards (Web)')
plt.xlabel('subjectivity')
plt.ylabel('count')
plt.grid(False)
plt.show()

## Spell Checking

In [None]:
nlp = spacy.load('en_core_web_sm')

Here we put in all the words that the spell checker doesn't automatically know, but we want it to remember and not delete.

In [None]:
spell = SpellChecker()

spell.word_frequency.load_words([
    'Arlington'
    , 'Glebe'
    , 'Ballston'
    , 'Rosslyn'
    , 'Pershing'
    , 'Rockville'
    , 'MD'
    , 'VA'
    , 'Maryland'
    , 'Virginia'
    , 'Bluemont'
    , 'Wilson'
    ])

This code goes through and finds which words are mispelled and corrects them, while also removing unnecessary words. 

In [None]:
def spell_check(text):
    doc = nlp(text)  # Process the text with spaCy
    corrected_words = []
    
    # Find misspelled words
    misspelled = spell.unknown([token.text for token in doc if not token.is_punct and not token.is_stop])

    for token in doc:
        if not token.is_punct and not token.is_stop and len(token.text.strip()) > 0:  # Exclude punctuation and stop words
            word = token.text.strip()
            if word.lower() in misspelled:
                correction = spell.correction(word)
                if (correction is not None) and (correction.lower() != word.lower()):
                    corrected_words.append(correction)
                    #Uncomment this line to review the list of words that are correcting
                    #print(f"Correcting {word} => {correction}")
                else:
                    corrected_words.append(word.lower())
            else:
                corrected_words.append(word)  # Preserve correct words

    if len(corrected_words)>0:
        return " ".join(corrected_words)
    else: 
        return ""

This puts the checked words into a new column called checked_text

In [None]:
df['checked_text'] = df['All_text'].apply(spell_check)

In [None]:
df

## Vectors
Here we need to import the medium package instead of the small, as we need the vectors that are only found in the medium and large packages.

In [None]:
nlp = spacy.load("en_core_web_md") 

Here we put the topic we want it to look for, as well as how many of the most closely related values we want it to give.

In [None]:
query = "school"
related_values = 3

This finds how similar the given cell is to the query.

In [None]:
def similarityToQuery(text):
    return nlp(text).similarity(nlp(query))

This goes through all of the cells in the table and find their similarity.

In [None]:
df["similar_to_query"] = df["checked_text"].apply(similarityToQuery)

This loops through as many times as given to display the most closely related sentences.

In [None]:
for i in range(related_values):
    print(df.sort_values('similar_to_query', ascending=False).iloc[i]["checked_text"])


## Conclusion

Finally, a summary that reflects on this project, what you've learned from it, and what you thought of it. Feel free to discuss whatever you want in this section, you might want to discuss Pandas, Spacy, Arlington, surveys, and/or data science.


I believe this project was a great way to truly see what bring this community together, and I think that this data visualization was extremely helpful in identifying the themes and giving valuable visualizations of the data. I learned how to create word clouds, how to properly sort through data to remove empty cells and correct misspelled words, and I learned about polarity and subjectivity analysis. 