##### 0. Imports

In [1]:
# Import the required packages here
#import tensorflow as tf
import torch
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig

import numpy as np
from scipy.special import softmax
import csv
import urllib.request
import pandas as pd                                                    # for data handling
import xml.etree.cElementTree as ET                                    # for parsing XML file

import matplotlib.pyplot as plt
%matplotlib inline

import mapclassify                                                     # required for animated scatterplot map (plotly)
import geopandas as gpd                                                # geographic data handling
import folium                                                          # interactive mapping capabilities
import folium.plugins as plugins
import plpygis                                                         # a converter to and from the PostGIS geometry types, WKB, GeoJSON and Shapely formats
from plpygis import Geometry
from shapely.geometry import Point, Polygon, shape                     # creating geospatial data
from shapely import wkb, wkt                                           # creating and parsing geospatial data
import shapely                                                  

import plotly
import plotly.express as px                                            # for interactive, animated timeseries map
import seaborn as sns; sns.set(style="ticks", color_codes=True)
# import json

from wordcloud import WordCloud
from wordcloud import ImageColorGenerator
from wordcloud import STOPWORDS
from PIL import Image    # for masking wordcloud with an image
import requests          # for accessing url image
from io import BytesIO   # for accedssing url image


  from .autonotebook import tqdm as notebook_tqdm


### Task 1: Data Preparation

To get started with any kind of geospatial analysis of textual data, you will first need to import the data and prepare it for any further analyses. These steps are very common and you will likely have to perform these steps any time you want to do a similar analysis on your own or with different data. 

##### 1. Get the data

There are several ways you can get data for analysis. Twitter data can be publicly accessed .... . In this case we will access a CSV file that already contain a bunch of tweets. Download the data from https://drive.google.com/file/d/1hHkde7xSGNlemYRw6yFFj5RfRfAGbFHS/view?usp=share_link and add it to this repository.

##### 2. Load the data into a pandas DataFrame

In [None]:
# Load the data using the pandas function .read_csv()
# hint: you may have to specify the correct encoding of csv so that the pandas library can read the data properly


In [14]:
# ------------------------------- SOLUTION -------------------------------
# Load the data
data = pd.read_csv('twittertakeover.csv', encoding='latin1')

##### 3. Take a first look at the data

- Check how large the dataset is
- Take a look at what kind of columns we have
- Check the data types

In [None]:
# Check how large the dataset is: print out how many rows there are


In [None]:
# Print the names of the columns

In [None]:
# Check the datatypes for each of the columns

In [15]:
# ------------------------------- SOLUTION -------------------------------
# Check how large the dataset is: print out how many rows there are
print(len(data))

97662


In [16]:
# ------------------------------- SOLUTION -------------------------------
# Print the names of the columns
data.columns

Index(['message_id', 'date', 'text', 'tags', 'tweet_lang', 'source', 'place',
       'geom', 'retweets', 'tweet_favorites', 'photo_url', 'quoted_status_id',
       'user_id', 'user_name', 'user_location', 'followers', 'friends',
       'user_favorites', 'status', 'user_lang', 'latitude', 'longitude',
       'data_source', 'keywords', 'sentiment', 'county_id', 'tweet_integer',
       'tweet_positive', 'tweet_negative'],
      dtype='object')

In [17]:
# ------------------------------- SOLUTION -------------------------------
# Check the datatypes for each of the columns
data.dtypes

message_id          float64
date                 object
text                 object
tags                 object
tweet_lang           object
source               object
place                object
geom                 object
retweets              int64
tweet_favorites       int64
photo_url            object
quoted_status_id    float64
user_id             float64
user_name            object
user_location        object
followers             int64
friends               int64
user_favorites        int64
status                int64
user_lang           float64
latitude            float64
longitude           float64
data_source          object
keywords               bool
sentiment            object
county_id             int64
tweet_integer         int64
tweet_positive      float64
tweet_negative      float64
dtype: object

##### 4. Create a geodataframe

This is a crucial part of the data preparation: we need to be able to work with the spatial characteristics of this data. As you will have seen in the data types, the "geom" column is an 'object' type. To do any geospatial analyses we need to convert this DataFrame into a GeoDataFrame. 

Let's first take a look at what the "geom" column looks like. Here is the "geom" value of the first row in our dataset:

In [18]:
# Take a look at the "geom" values 
data.loc[0, 'geom']

'POLYGON ((-121.4168716 37.883347, -121.183979 37.883347, -121.183979 38.078305, -121.4168716 38.078305, -121.4168716 37.883347))'

This is a representation of a polygon using the <b>Well-Known Text (WKT)</b> format. WKT is a text-based representation of geometry objects, defined by the Open Geospatial Consortium (OGC).

There are two key steps involved in the conversion of this text-based representation to an actual geometry data type:
1. Convert the WKT strings into shapely ```'Polygon'``` objects. This is done by using the shapely ```.apply(wkt.loads)``` function on the "geom" column.
2. Use the ```GeoDataFrame``` constructor from the ```geopandas``` library to create a geodataframe. Make sure to specify the coordinate reference system as "EPSG:4326"

In [None]:
# Create a new column in the dataframe called "geometry" by converting the 'geom' column to shapely Polygon objects 
data['geometry'] = 

# create a geodataframe from the pandas dataframe
geodata = 

In [19]:
# ------------------------------- SOLUTION -------------------------------

# convert the 'geom' column to shapely Polygon objects
data['geometry'] = data['geom'].apply(wkt.loads)

# create a geodataframe from the pandas dataframe
geodata = gpd.GeoDataFrame(data, crs='EPSG:4326')

Let's check what kind of geodatatypes we have

In [24]:
# ------------------------------- SOLUTION -------------------------------
print(geodata.geom_type.value_counts())

Polygon    96509
Point       1153
Name: count, dtype: int64


Let's make sure we only have points. One way of doing this is by turning all polygons into points, where the point is the polygon's center.

In [26]:
# ------------------------------- SOLUTION -------------------------------

# Create a new column called "geom_type"
geodata['geom_type'] = geodata['geometry'].apply(lambda x: x.geom_type)

# Iterate over each row in the geodataframe
for index, row in geodata.iterrows():
    # Check if the row is a Polygon
    if row['geom_type'] == 'Polygon':
        # Get the centroid of the polygon
        centroid = row['geometry'].centroid
        # Create a new Point geometry from the centroid
        point_geom = Point(centroid.x, centroid.y)
        # Update the geometry of the row with the new Point geometry
        geodata.at[index, 'geometry'] = point_geom
        # Update the geom_type of the row to 'Point'
        geodata.at[index, 'geom_type'] = 'Point'


Now let's just make sure we have successfully converted all our geometries into points:

In [27]:
# ------------------------------- SOLUTION -------------------------------
print(geodata.geom_type.value_counts())

Point    97662
Name: count, dtype: int64


##### 5. Filter the data by relevant columns

##### 6. Filter the data by location

<img src="https://github.com/Christina1281995/demo-repo/blob/main/nyc.PNG?raw=true" align="right">

Now that we have a geodataframe, we are almost ready to go. The last preparation step will be to <b>filter</b> our data for only the relevant data that we are interested in. 

Specifically, we are interested only in a small area: <b>New York City</b>. 

To filter the data by a certain area, we need a shapefile of that area.  Take a look <u>[here](https://geodata.lib.utexas.edu/catalog/nyu-2451-34490)</u> at the shapefile we will be using.

Download the shapefile from the link provided above. Store <u>all</u> of the files you have downloaded in a new folder in this repository. 

In [34]:
# ------------------------------- SOLUTION -------------------------------

# Read in the boundary of New York City
nyc = gpd.read_file("new york/nyc.shp")

# Reproject nyc to the common CRS
nyc = nyc.to_crs("EPSG:4326")

In [35]:
# ------------------------------- SOLUTION -------------------------------

# Filter your geodataframe by the boundary of New York City
filtered_gdf = gpd.overlay(geodata, nyc, how='intersection')

print(len(filtered_gdf))

1235


In [36]:
# ------------------------------- SOLUTION -------------------------------

# View the filtered geodataframe
filtered_gdf.head()

Unnamed: 0,message_id,date,text,tags,tweet_lang,source,place,geom,retweets,tweet_favorites,...,county_id,tweet_integer,tweet_positive,tweet_negative,geom_type,bcode,bname,name,namelsad,geometry
0,1.58148e+18,16/10/2022 02:46,Do you remember when you joined Twitter? I do!...,MyTwitterAnniversary,en,"<a href=""http://twitter.com/download/android"" ...",Metropolitan Hospital Center,"POLYGON ((-73.944941 40.784863, -73.944941 40....",0,0,...,402,1,1.0,,Point,36061,Manhattan,New York,New York County,POINT (-73.94494 40.78486)
1,1.59311e+18,17/11/2022 05:06,Thank you @oxbakerjr #OxBakerJr for follow me ...,OxBakerJr,en,"<a href=""http://instagram.com"" rel=""nofollow"">...","Manhattan, NY",POINT (-73.946794 40.809148),0,0,...,402,1,1.0,,Point,36061,Manhattan,New York,New York County,POINT (-73.94679 40.80915)
2,1.58059e+18,13/10/2022 15:50,Loved celebrating #sonographerappreciationmont...,"sonographerappreciationmonth,MedicalUltrasound...",en,"<a href=""http://twitter.com/download/iphone"" r...",Alison,"POLYGON ((-73.947016 40.790971, -73.947016 40....",0,0,...,402,1,1.0,,Point,36061,Manhattan,New York,New York County,POINT (-73.94702 40.79097)
3,1.59363e+18,18/11/2022 15:40,Well Twitter is still here & so are we! Its g...,,en,"<a href=""http://twitter.com/download/iphone"" r...","Upper East Side, Manhattan","POLYGON ((-73.97299700000001 40.758656, -73.97...",0,0,...,402,1,1.0,,Point,36061,Manhattan,New York,New York County,POINT (-73.96056 40.77043)
4,1.58538e+18,26/10/2022 21:12,A special field trip to the HF office to wish ...,CardioTwitter,en,"<a href=""http://twitter.com/download/iphone"" r...",The Mount Sinai Hospital,"POLYGON ((-73.952957 40.79011, -73.952957 40.7...",0,0,...,402,1,1.0,,Point,36061,Manhattan,New York,New York County,POINT (-73.95296 40.79011)


### Task 2: Exploratory Data Analysis

### Task 3: Choose an appropriate NLP Model for your Analysis

### Task 4: Plot the Analysis Results on a Map

### Task 5: Interpret the Results