## Hello there, the following code is made to clean up and visual the data set from the Hispanic Heritage Community Festival. 

### For context, data was collected asking Arlington residents questions related to the Arlington and how or where they see Arlington in the year 2050. The data gathered was in english, spanish, and voice messsges. In order to sort through the data I needed to first clean up the excel file in order to create graphs ands chart highlighting the results found. For this I used python and pandas to clean up the excel file, and seaborn, spacytextblob, and matplotlib to visualize the data. 

Imports

In [None]:
%%bash
python3 -m spacy download en_core_web_sm
python3 -m pip install --upgrade matplotlib
python3 -m pip install --upgrade wordcloud
python3 -m pip install --upgrade numpy
python3 -m pip install --upgrade spacy
python3 -m pip install --upgrade spacytextblob
python3 -m pip install --upgrade textblob
python3 -m pip install --upgrade pyspellchecker

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import spacy 

In [None]:
from spellchecker import SpellChecker
from wordcloud import WordCloud, STOPWORDS
from spacytextblob.spacytextblob import SpacyTextBlob  
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('spacytextblob')
sns.set_theme()

In [None]:
df_hispanic = pd.read_excel("Hispanic Heritage data.xlsx")
pd.read_excel("Hispanic Heritage data.xlsx")


All 4 boxes of python here are meant to clean up the headings for easier access and legibility

In [None]:
df_hispanic.dtypes

In [None]:
df_hispanic = df_hispanic.drop([0, 1]).reset_index(drop=True)

In [None]:
df_hispanic.columns = ['id', 'first', 'first_translated', 'second', 'second_translated']

This block of code looks through 4 coloumns that are all intedned to store data from 1 row/response. Given some text needed to be translated and some did not, there were many coloumns left blank. This is why in the next block of code a new coloumn with all the data is made for easy access. 

In [None]:
def concatenate_text(row):
    text = ""
    if pd.notna(row['first_translated']):
        text += row['first_translated']
    elif pd.notna(row['first']):
        text += row['first']
    
    if pd.notna(row['second_translated']):
        text += ". " + row['second_translated']
    elif pd.notna(row['second']):
        text += ". " + row['second'] 

    return text # This returns the text of the given row

This code uses Sentiment Analyses to determine the polarity of the code, meaning how positive or negative is the given word or words. It also creates a new coloumn to store this text. 

In [None]:
df_hispanic['concatenated_text'] = df_hispanic.apply(concatenate_text, axis=1) # new coloumn of all the data
df_hispanic['source'] = "Hispanic Heritage Fest"

def polarity(row):
   
   doc = nlp(row['concatenated_text']) # collecting the data of the given row in a loop
   
   
   
   return doc._.blob.polarity # putting the text through Sentiment Analyses to get its polarity

    

polarity
df_hispanic['polarity'] = df_hispanic.apply(polarity, axis=1) # new coloumn to show the polarity

In [None]:
df_hispanic 

This code takes text and turns it into a wordcloud. A word cloud is a cluster of words shown as an image that has more frequently used words larger and closer to the middle. This allows us to see common words, and given the context we can see if there is a common problem, or something everyone agrees on.

In [None]:
long_string = " ".join(df_hispanic['concatenated_text'])
wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=100, mask=None, contour_width=3, contour_color='steelblue').generate(long_string)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

The wordcloud showed that the most common words were words like Arlington, community, school. Although the wordcloud shows what words were used the most, it does not show if they were used in a positive or negative way. This is why the previous code using Sentiment Analyses is important because it gives context on the positivity or negativity used in the responses. 

This block of code uses seaborn to show the data in a histogram. This graph shows the polarity of each answer as a bar graph. This allows visualization of how many comments were negative and how many were positive. 

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(data=df_hispanic, x='polarity')
plt.title('Sentiment Analyses')
plt.xlabel('polairty')
plt.ylabel('amount of answers')
plt.grid(True)
plt.show

This first block of code prepares the spell checker for words it does not have like "Arlington", or "VA". These words are important to have given that the dataset is about Arlington Virginia. 

In [None]:
spell = SpellChecker()
spell.word_frequency.load_words([
    'Arlington'
    , 'Glebe'
    , 'Ballston'
    , 'Rosslyn'
    , 'Pershing'
    , 'Rockville'
    , 'MD'
    , 'VA'
    , 'Maryland'
    , 'Virginia'
    , 'Bluemont'
    , 'Wilson'
    ])

This code runs pyspellchecker to correct the grammar of the text. The given text is split into words with tokenization and then run through a spacy word databse to find if the word needs to be corrected or not. 

In [None]:
def spell_check(text):
    doc = nlp(text)  # Process the text with spaCy
    corrected_words = []
    
    # Find misspelled words
    misspelled = spell.unknown([token.text for token in doc if not token.is_punct and not token.is_stop])

    for token in doc:
        if not token.is_punct and not token.is_stop and len(token.text.strip()) > 0:  # Exclude punctuation and stop words
            word = token.text.strip()
            if word.lower() in misspelled:
                correction = spell.correction(word)
                if (correction is not None) and (correction.lower() != word.lower()):
                    corrected_words.append(correction)
                    #Uncomment this line to review the list of words that are correcting
                    #print(f"Correcting {word} => {correction}")
                else:
                   corrected_words.append(word.lower())
            else:
                corrected_words.append(word)  # Preserve correct words

    if len(corrected_words)>0:
        return " ".join(corrected_words)
    else: 
        return ""

df_hispanic['spell_checked_text'] = df_hispanic['concatenated_text'].apply(spell_check)
df_hispanic

This code loops through each response and compares how similar they are to a given query. Using a problem belived to show up in the data, the code compares the word vector from the prompt to the data. The data with the closest vector is the output given how similar it is to the query. Given the query might not have one or more responses similar to it, the data with words whom's vector is similar to it gets selected as an output. One example of this could be the word "bus" being a result with a query that uses the word "school". 

In [None]:
query = "missing middle"
def similarityToQuery(text):
    return nlp(text).similarity(nlp(query))
df_hispanic['similarity_to_query'] = df_hispanic['spell_checked_text'].apply(similarityToQuery)
pd.set_option('display.max_colwidth', None)
print(df_hispanic.sort_values('similarity_to_query', ascending=False).iloc[0]["concatenated_text"])
print(df_hispanic.sort_values('similarity_to_query', ascending=False).iloc[1]["concatenated_text"])
print(df_hispanic.sort_values('similarity_to_query', ascending=False).iloc[2]["concatenated_text"])

## Summary

### This project started off with importing and clceaning an excel file. Once that was done, I used charts and wordclouds to illustrate what was most common amongst the answers, and if they were postivie or negative. I also inlcuded a tool that works similar to a search engine by pulling the most similar responses to any prompt. I specifically focused on Sentiment Analyses which focuses on polarity and objectivity. I belive that it was very neccesary for this project becuase without it we could only know what the responses were about, but not what the people think about it, and that is very important to me. This is my first project using surveys and data science, I am happy with the results and look forward to more work along the lines of data and how to deal with it. 