Description of the final assignment: 

In small groups, you are going to perform a full front to end geospatial and semantic analysis of social media data. This is a task that has many real-world application contexts and all of the steps you will complete in this assignment can be useful as a point of reference later on in project work. 

- First, you will access CSV data and do some <b>preprocessing</b> steps to prepare the data for geospatial and semantic analyses.
- You'll then <b>decide</b> on which model from the hugging face platform is appropriate for your semantic analyses. 
- Using the <b>pipeline</b> approach, you'll generate a new column with your analysis results.
- After that, you will perform a <b>geospatial analysis</b> on the results of your <b>semantic analysis</b>. 

### Task 1: Data Preparation

To get started with any kind of geospatial analysis of textual data, you will first need to import the data and prepare it for any further analyses. These steps are very common and you will likely have to perform these steps any time you want to do a similar analysis on your own or with different data. 

##### 0. Import Libraries and Packages

In [None]:
# Import the required packages here
#import tensorflow as tf
import torch
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig

import numpy as np
from scipy.special import softmax
import csv
import urllib.request
import pandas as pd                                                    # for data handling
import xml.etree.cElementTree as ET                                    # for parsing XML file

import matplotlib.pyplot as plt
%matplotlib inline

import mapclassify                                                     # required for animated scatterplot map (plotly)
import geopandas as gpd                                                # geographic data handling
import folium                                                          # interactive mapping capabilities
import folium.plugins as plugins
import plpygis                                                         # a converter to and from the PostGIS geometry types, WKB, GeoJSON and Shapely formats
from plpygis import Geometry
from shapely.geometry import Point, Polygon, shape                     # creating geospatial data
from shapely import wkb, wkt                                           # creating and parsing geospatial data
import shapely                                                  

import plotly
import plotly.express as px                                            # for interactive, animated timeseries map
import seaborn as sns; sns.set(style="ticks", color_codes=True)
# import json

from wordcloud import WordCloud
from wordcloud import ImageColorGenerator
from wordcloud import STOPWORDS
from PIL import Image    # for masking wordcloud with an image
import requests          # for accessing url image
from io import BytesIO   # for accedssing url image


##### 1. Get the data

There are several ways you can get data for analysis. Twitter data can be publicly accessed .... . In this case we will access a CSV file that already contain a bunch of tweets. Download the data from https://drive.google.com/file/d/1hHkde7xSGNlemYRw6yFFj5RfRfAGbFHS/view?usp=share_link and add it to this repository.

##### 2. Load the data into a pandas DataFrame

Load the data using the pandas function .read_csv()

Hint: you may have to specify the correct encoding of csv so that the pandas library can read the data properly

In [None]:
# TO DO: Load the CSV data in to a pandas dataframe called "data"


##### 3. Take a first look at the data

- Check how large the dataset is by printing out the number of rows in the dataset
- Take a look at what kind of columns the dataset includes by printing out the names of the columns 
- Check the data types by printing out the individual data types of each column

In [None]:
# TO DO: Check how large the dataset is


In [None]:
# TO DO: Check what columns are included in the dataset 


In [None]:
# TO DO: Check the datatypes for each of the columns


##### 4. Create a geodataframe

This is a crucial part of the data preparation: we need to be able to work with the spatial characteristics of this data. As you will have seen in the data types, the "geom" column is an 'object' type. To do any geospatial analyses we need to convert this DataFrame into a GeoDataFrame. 

Let's first take a look at what the "geom" column looks like. You can use the ```.loc``` method to display the "geom" value of the first row, just to get an idea of the data.

In [None]:
# TO DO: Display the "geom" value in the first row of the dataset


This is a representation of a polygon using the <b>Well-Known Text (WKT)</b> format. WKT is a text-based representation of geometry objects, defined by the Open Geospatial Consortium (OGC).

There are two key steps involved in the conversion of this text-based representation to an actual geometry data type:
1. Convert the WKT strings into shapely ```'Polygon'``` objects. This is done by using the shapely ```.apply(wkt.loads)``` function on the "geom" column.
2. Use the ```GeoDataFrame``` constructor from the ```geopandas``` library to create a geodataframe. Make sure to specify the coordinate reference system as "EPSG:4326"

In [None]:
# TO DO: 1. Create a new column in the dataframe called "geometry" by converting the 'geom' column to shapely Polygon objects 

# TO DO: 2. Create a geodataframe called "geodata" from the pandas "data" dataframe, make sure you add the criteria crs='EPSG:4326'


Let's check what kind of geodatatypes we have. This code is prepared for you: 

In [None]:
# DONE FOR YOU: Check the geodatatypes in the dataset
# Delete the three ''' to use the code below

'''
print(geodata.geom_type.value_counts())
'''

<span style="color:red">Complete this markdown cell with your answers:</span>

1. Describe what the geometry types of your dataset are: 

2. Why do you think they have those geometry types (think about how the data might be collected from Twitter)?

3. Why might point geometries be more useful compared to polygon geometries when it comes to spatial analyses? 

Let's make sure we only have points. One way of doing this is by turning all polygons into points, where the new point is the polygon's center. 

To do that we first need to identify all rows that are polygons. This can be done by creating a new column called ```"geom_type"``` in our "geodata" geodataframe. We can create the column by applying a ```lambda``` function to each row, where the function just checks the ```geom_type``` of each row based on the ```"geometry"``` column. This bit of code is prepared for you:

In [None]:
# DONE FOR YOU: Create a new column called "geom_type"
# Delete the three ''' to use the code below

'''
geodata['geom_type'] = geodata['geometry'].apply(lambda x: x.geom_type)
'''

Next we can iterate over each row in the ```geodata``` DataFrame to check if each row represents a polygon geometry, and if so, replace the polygon geometry with a point geometry representing the centroid of the polygon. This bit of code is prepared for you as well:

In [None]:
# DONE FOR YOU: Iterate over every row in the dataframe to replace all polygon geometries with point geometries
# Delete the three ''' to use the code below

'''
# Iterate over each row in the geodataframe
for index, row in geodata.iterrows():
    # Check if the row is a Polygon
    if row['geom_type'] == 'Polygon':
        # Get the centroid of the polygon
        centroid = row['geometry'].centroid
        # Create a new Point geometry from the centroid
        point_geom = Point(centroid.x, centroid.y)
        # Update the geometry of the row with the new Point geometry
        geodata.at[index, 'geometry'] = point_geom
        # Update the geom_type of the row to 'Point'
        geodata.at[index, 'geom_type'] = 'Point'
'''

Now let's just make sure we have successfully converted all our geometries into points:

In [None]:
# DONE FOR YOU: Check the geometry types again to make sure we successfully converted all polygons
# Delete the three ''' to use the code below

'''
print(geodata.geom_type.value_counts())
'''

##### 5. Filter the data by relevant columns

In step 3, when you checked the columns in this dataset, you will have noticed that there are quite a few columns. It's unlikely that you'll require all of those for your analysis. Therefore take a moment to create a new DataFrame from the current one, but only with the relevant columns. Which columns are relevant is left up to you, but make sure you have at least the columns that contain the actual tweets and the geometry! 

In [None]:
# TO DO: Create a new dataframe called "new_geodata" with only the relevant columns


##### 6. Filter the data by location

<img src="https://github.com/Christina1281995/demo-repo/blob/main/nyc.PNG?raw=true" align="right">

Now that we have a geodataframe, we are almost ready to go. The last preparation step will be to <b>filter</b> our data for only the relevant data that we are interested in. 

Specifically, we are interested only in a small area: <b>New York City</b>. 

To filter the data by a certain area, we need a shapefile of that area.  Take a look <u>[here](https://geodata.lib.utexas.edu/catalog/nyu-2451-34490)</u> at the shapefile we will be using.

Download the shapefile from the link provided above. Store <u>all</u> of the files you have downloaded in a new folder in this repository. 

In [None]:
# ------------------------------- SOLUTION -------------------------------

# Read in the boundary of New York City
nyc = gpd.read_file("new york/nyc.shp")

# Reproject nyc to the common CRS
nyc = nyc.to_crs("EPSG:4326")

In [None]:
# ------------------------------- SOLUTION -------------------------------

# Filter your geodataframe by the boundary of New York City
filtered_gdf = gpd.overlay(new_geodata, nyc, how='intersection')

print(len(filtered_gdf))

In [None]:
# ------------------------------- SOLUTION -------------------------------

# View the filtered geodataframe
filtered_gdf.head()

Let's check to make sure that the points in our filtered dataset are indeed in New York! We can do this using the [plotly scatter_mapbox tools](https://plotly.github.io/plotly.py-docs/generated/plotly.express.scatter_mapbox.html). Plotly maps require individual columns for latitude and longitude in order to plot the points on a map. So we'll first have to create those two columns.

In [None]:
# Define new columns for latitude and longitude
filtered_gdf['lat'] = filtered_gdf['geometry'].apply(lambda x : x.y if x else np.nan)
filtered_gdf['lon'] = filtered_gdf['geometry'].apply(lambda x : x.x if x else np.nan)

In [None]:
# plotly mapbox
fig = px.scatter_mapbox(filtered_gdf,                       # the dataset
                        lat="lat",                          # the column in the dataset indicating the latitude
                        lon="lon",                          # the column in the dataset indicating the longitude
                        color="bcode",                      # the column in the dataset by which the points should be colored
                        center=dict(                        # the coordinates that the map should center on
                                    lat=40.7,
                                    lon=-73.9
                                ),
                        zoom=9,                             # the initial zoom-level of the map (higher numbers = more zoomed in)
                        mapbox_style='carto-positron',      # the style of the base map 
                        height=600                          # height of the map figure in pixels
                        )      
fig.show()

In [None]:
# Create a choropleth map using a GeoDataFrame
fig = px.choropleth_mapbox(
    filtered_gdf,  # GeoDataFrame
    geojson=filtered_gdf.geometry,  # Use the "geometry" column for mapping
    mapbox_style="open-street-map",
    color="date",
    zoom=3,
    center={"lat": 40, "lon": -73},  # Set the initial center of the map
)

# Show the map
fig.show()

### Task 2: Choose an appropriate NLP Model for your Analysis

##### 1. Information

Now is the time to think about your semantic analysis. 

First you may want to ask yourselves <i>"What questions <b>can</b> we even answer using NLP?"</i> and <i>"What questions would be <b>useful and interesting</b> to answer?"</i> 

You may browse the available models on the hugging face platform (focus on the models under the category "Natural Language Processing" and "Text Classification"). Most of the models have some documentation and descriptions. Make a decision together about which model you will use. In your submission, add a markdown cell describing your thoughts and reasons for the model.

##### 2. Your choice of NLP model

<span style="color:red">Complete this markdown cell with your answers:</span>

1. Model chosen:

2. Discussion on why this particular model was chosen: 


##### 3. Import the model

In [None]:
# TO DO 

##### 4. Run a semantic analysis on your data

In [None]:
# TO DO

### Task 4: Plot the Analysis Results on a Map

### Task 5: Interpret the Results