### Capturing Streaming Data and Automating Exploratory Data Analysis

#### Author: Chris Schmidt

#### Abstract

This project was part of an employment assessment where data from a streaming API that captured RSVP data for events put together by meetup.com groups was available.  The code block that is commented out captures the first 10,000 lines of jsonl code from the meetup RSVP streaming API after initiation, saving it in a file named *sample_rsvp.jsonl* that can be directed to the users file path of choice. There is a sample_rsvp.jsonl file attached of data that was captured when running this block of code. 

The approach used normalizes the sample.jsonl file that is available to analyze and create a master pandas dataframe then split that dataframe into the two dataframes, RSVP and EVENTS, with the selected fields in each. The pandas-profiling module was then used to automate the EDA (exploratory data analysis) process on each of the dataframes and detail the findings.  

#### Summary

The data is not amenable to a traditional statistical analysis for the most part so summary statistics and an evaluation along those lines is elusive. There are additional things that could be done, such as using the latitude and longitude information to do geolocation and mapping the languages to the repsective coordinates to look for anomalies in that space.

The pandas-profiling module is one of several available automated EDA tools for Python and it allows the creation of reports that open in Jupyter and reports that are generated in HTML that can be shared with others. 

The highlights of each table are listed below. The detailed reports and additional detailed perspective on the output is found in the report below. 

The RSVP table report identifies several areas to highlight that may be an issue.
- There are 565 (0.8%) duplicate rows matched in rsvp_id. 
- The guests columns has 64500 zeros (95.3%).    

The EVENTS table report identifies several areas to highlight that may be an issue.
- Missing 2.8% of the cells overall. 
- Missing 8.4% of the venue name and location columns, lat. and lon.
- Duplicates of character strings identified in the event ID column.
- The columns event_name, group_name, and group_url are populated with combinations of mixed case, all capital letter, all lower case letters and occasional emoji's in the same cells. The emoji's in the data may be a reason for a flag on this issue.
- There are 11,906 duplicate rows that need further investigation to ascertain if there is a rational reason for the duplication.


#### Goals

1. Develop code block that connects to the API to capture 10,000 lines of the stream in a file name sample.jsonl before terminating.
2. The captured data should be sent to two tables: The first named RSVP and the second named EVENTS with the tables updating as information comes into only this streaming endpoint.
3. The RSVP table will have the following fields captured:

    - rsvp_id,
    - mtime,
    - response,
    - guest_count,
    - event_id.
 
4. The EVENTS table will have the following field captured:

    - event_id,
    - event_name,
    - event_time, 
    - venue_name,
    - venue_lat,
    - venue_lon,
    - group_country,
    - group_name,
    - group_urlname. 

5. Perform exploratory data analysis on the provided file, named sample.jsonl, consisting of previously acquired meetup RSVP streaming data.
    - Show detailed level analysis of the data.
    - Identify issues that will  need to be dealt with in the data.
    - 
6. Write up summary of findings
    - This is done in the Summary found above. 

Use pip install for requests, flatten_json, and pandas_profiling

In [1]:
#pip install requests

In [2]:
#pip install flatten_json

In [3]:
# pip install pandas-profiling

Install the following modules and packages:

In [35]:
import numpy as np
import pandas as pd
import json
import requests
from pandas.io.json import json_normalize
from pandas_profiling import ProfileReport
from flatten_json import flatten


#### Code to capture 10,000 lines of streaming data from the meetup API below

This is the code to capture streaming data from the meetup API and cut off at 10,000 lines. 

In [45]:
# code for ingesting 10K lines of streaming data from the meetup API and terminated at that number.
# This takes a VERY long time to run on 10,000 lines - PRO TIP: Don't even try!. If there are problems
# with the streaming unicode, it is identified with the try block and an output of occurences labelled 
# "bad unicode" is generated.

#index = 0
#file = open('sample_rsvp.jsonl','w')
#request = requests.get('http://stream.meetup.com/2/rsvps', stream=True)
#for raw_rsvp in r.iter_lines():
#    if raw_rsvp:
#        try:
#            file.write(raw_rsvp.decode('utf-8'))
#            file.write("\n")
#        except:
#            print("bad unicode")
#    index = index + 1
#    if index > 10000:
#        break
#file.close()




bad unicode
bad unicode


In [48]:
# Stream capture example on only 10 lines to illustrate that capture block does work. 

#StreamDf = pd.json_normalize(a)
#StreamDf.head()

Unnamed: 0,visibility,response,guests,rsvp_id,mtime,venue.venue_name,venue.lon,venue.lat,venue.venue_id,member.member_id,...,group.group_id,group.group_name,group.group_lon,group.group_urlname,group.group_lat,group.group_state,member.other_services.facebook.identifier,member.other_services.twitter.identifier,member.other_services.flickr.identifier,member.other_services.tumblr.identifier
0,public,yes,0,1872717810,1620498334477,Victoria Park,-0.038972,51.53656,26148882.0,261343705,...,32511274,London Philosophy Collective,-0.1,London-Philosophy-Collective,51.52,,,,,
1,public,no,0,1872032843,1620498339708,,,,,199239918,...,2689942,Adelaide Walkers & Joggers,138.6,Adelaide-Walkers-Joggers,-34.93,,,,,
2,public,yes,0,1872717815,1620498341704,Bluemont Park Picnic Pavilion,-77.130684,38.869236,26308501.0,223227249,...,27673469,Smiley Social,-77.1,Smiley-Social,38.89,VA,,,,
3,public,no,0,1870719986,1620498341802,General Aviation Center (GAC),8.57379,47.45337,27074423.0,312439269,...,35015193,Fly Zürich,8.54,fly-zurich,47.38,,,,,
4,public,yes,0,1872717816,1620498342950,Online event,179.1962,-8.521147,26906060.0,331379582,...,24897058,Spiritual Experiences Group of Steamboat Springs,-106.85,Spiritual-Experiences-Group-of-Steamboat-Springs,40.48,CO,,,,


####  Open the sample.jsonl file

In [6]:
# Open json file data.json
file = open(r'C:\Users\pchri\Chris Schmidt\sample.jsonl')

#### View the sample.jsonl file in a pandas dataframe

Viewing the first 10 rows and the last 10 rows of the dataframe we can see that the jsonl format needs to be normlized before creating the tables we are interested in creating. 

In [7]:
# import data.json file from Alion folder
df = pd.read_json(r'C:\Users\pchri\Chris Schmidt\sample.jsonl', lines=True)

In [8]:
df.head(10)

Unnamed: 0,venue,visibility,response,guests,member,rsvp_id,mtime,event,group
0,"{'venue_name': 'Victoria Park', 'lon': -0.0389...",public,yes,0,"{'member_id': 261343705, 'photo': 'https://sec...",1872717810,1620498334477,"{'event_name': 'Philosophy in the Park 013', '...",{'group_topics': [{'urlkey': 'critical-thinkin...
1,,public,no,0,"{'member_id': 199239918, 'photo': 'https://sec...",1872032843,1620498339708,{'event_name': ' Wine Shanty Altitude Hike. 8:...,"{'group_topics': [{'urlkey': 'wellness', 'topi..."
2,{'venue_name': 'Bluemont Park Picnic Pavilion'...,public,yes,0,"{'member_id': 223227249, 'photo': 'https://sec...",1872717815,1620498341704,{'event_name': 'Volleyball 101 - Absolute Begi...,"{'group_topics': [{'urlkey': 'social', 'topic_..."
3,{'venue_name': 'General Aviation Center (GAC)'...,public,no,0,"{'member_id': 312439269, 'photo': 'https://sec...",1870719986,1620498341802,"{'event_name': 'Adventure Flight ', 'event_id'...",{'group_topics': [{'urlkey': 'flying-experienc...
4,"{'venue_name': 'Online event', 'lon': 179.1962...",public,yes,0,"{'member_id': 331379582, 'photo': 'https://sec...",1872717816,1620498342950,"{'event_name': 'Experience HU, The Sound of So...","{'group_topics': [{'urlkey': 'eckankar', 'topi..."
5,"{'venue_name': 'Online event', 'lon': 179.1962...",public,yes,0,"{'member_id': 63075752, 'photo': 'https://secu...",1872717817,1620498343772,{'event_name': '🎵3️⃣ 🎉 Three Year Onlineversor...,"{'group_topics': [{'urlkey': 'conversation', '..."
6,{'venue_name': 'La Pecera del Círculo de Bella...,public,yes,0,"{'member_id': 277023324, 'photo': 'https://sec...",1872717818,1620498343992,{'event_name': 'SOCIAL & LANGUAGE EXCHANGE MEE...,"{'group_topics': [{'urlkey': 'language', 'topi..."
7,"{'venue_name': '635 Middle Country Rd', 'lon':...",public,yes,1,"{'member_id': 329194397, 'photo': 'https://sec...",1872717819,1620498345786,{'event_name': 'THURSDAY PICKUP COED SOCCER (C...,{'group_topics': [{'urlkey': 'pick-up-soccer-s...
8,"{'venue_name': 'Compton', 'lon': -1.336569, 'l...",public,yes,0,"{'member_id': 261515230, 'photo': 'https://sec...",1872717820,1620498346666,{'event_name': 'Compton - Oliver's Battery Cir...,"{'group_topics': [{'urlkey': 'hiking', 'topic_..."
9,"{'venue_name': 'Georgetown Art Center', 'lon':...",public,yes,0,"{'member_id': 331167042, 'photo': 'https://sec...",1872717823,1620498347996,{'event_name': 'Georgetown Art Center-Artist R...,"{'group_topics': [{'urlkey': 'art', 'topic_nam..."


In [9]:
df.tail(10)

Unnamed: 0,venue,visibility,response,guests,member,rsvp_id,mtime,event,group
67680,"{'venue_name': 'New Brighton Club', 'lon': 172...",public,yes,0,"{'member_id': 256608111, 'member_name': 'Graha...",1872796492,1620581195643,{'event_name': 'Mama Rock at New Brighton Club...,"{'group_topics': [{'urlkey': 'pubs-bars', 'top..."
67681,"{'venue_name': 'Kowloon Park Sports Centre', '...",public,yes,0,"{'member_id': 2804022, 'photo': 'https://secur...",1872601861,1620581196538,"{'event_name': 'Outdoor Kung Fu: Movement, Fit...",{'group_topics': [{'urlkey': 'self-exploration...
67682,{'venue_name': 'Holistic Gateway Healing Arts ...,public,yes,0,"{'member_id': 9541156, 'photo': 'https://secur...",1872796475,1620581186000,"{'event_name': 'Pay what you can Acupuncture',...","{'group_topics': [{'urlkey': 'wilderness', 'to..."
67683,"{'venue_name': 'Café Del Río', 'lon': -3.72369...",public,yes,0,"{'member_id': 233327142, 'photo': 'https://sec...",1872796494,1620581196810,{'event_name': 'Intercambio de libros al aire ...,"{'group_topics': [{'urlkey': 'spanish', 'topic..."
67684,"{'venue_name': 'Online event', 'lon': 179.1962...",public,yes,0,"{'member_id': 248952652, 'photo': 'https://sec...",1872796495,1620581196853,{'event_name': '[ONLINE] EE Business Networki...,{'group_topics': [{'urlkey': 'entrepreneurship...
67685,"{'venue_name': 'Online event', 'lon': 179.1962...",public,yes,0,"{'member_id': 303426838, 'photo': 'https://sec...",1872796496,1620581198893,{'event_name': '¿Qué estructura liberadora pue...,"{'group_topics': [{'urlkey': 'scrum', 'topic_n..."
67686,{'venue_name': 'Avenida Marques de l'Argentera...,public,yes,0,"{'member_id': 289198365, 'photo': 'https://sec...",1872796497,1620581199181,{'event_name': 'Power Vinyasa Flow at My Centr...,"{'group_topics': [{'urlkey': 'yoga', 'topic_na..."
67687,"{'venue_name': 'BWI Airport', 'lon': -76.70160...",public,yes,0,"{'member_id': 327675966, 'member_name': 'Jasma...",1872796498,1620581199191,{'event_name': 'Sunscape Puerto Vallarta Resor...,"{'group_topics': [{'urlkey': 'travel', 'topic_..."
67688,,public,no,0,"{'member_id': 22468351, 'photo': 'https://secu...",1872796468,1620581199154,{'event_name': 'Sac Tall Club Business Meeting...,"{'group_topics': [{'urlkey': 'tall', 'topic_na..."
67689,"{'venue_name': 'Eixample Dret', 'lon': 2.17386...",public,no,0,"{'member_id': 331680216, 'photo': 'https://sec...",1872796376,1620581199338,"{'event_name': '🍴🍷 Comida de amistad', 'event_...","{'group_topics': [{'urlkey': 'outdoors', 'topi..."


In the 'venue', 'member', 'event', and 'group' columns there are other attributes. 
1. "venue":{"venue_name":"Name of Venue", "lon":0.00000, "lat":00.0000, "venue_id": 12345678}
2. "event":{"event_name": "Name of Event", "event_id": "123456789", "time":1234560000, "event_url":"https\/\/www.meetup.com\/City-Type-Something\/events\/123456789\/", }
3. "group":{"group_topics":[{"urlkey":"debates", "topic_name":"Topic"}], "group_city":"City", "group_country":"initials", "group_id":12345678, "group_name":"Name of Group", "group_lon": 0.000, "group_urlname":"jgdalkj"}

Need to separate these into two data frames with column names pulled from key:value pairs.

In [10]:
df.columns

Index(['venue', 'visibility', 'response', 'guests', 'member', 'rsvp_id',
       'mtime', 'event', 'group'],
      dtype='object')

#### Normalize the jsonl data
We need to normalize the jsonl data to expand the columns that contain multiple key:value pairs in order to create a data frame with the columns that we need. This will allow the creation of a dataframe with all the keys in separate columns. 

C:\Users\pchri\Chris Schmidt\JobStuff\Alion\sample.jsonl

In [11]:
# To get a normalized version of the datafram we can use:
import json

a = []
for l in open('C:\\Users\\pchri\\Chris Schmidt\\sample.jsonl', encoding="utf8"):
  a.append(json.loads(l))

df = pd.json_normalize(a)


#### Create dataframe df with the normalized sample.jsonl data

This creates a dataframe with 29 columns that can be used to create the EVENTS and RSVP tables that are required. View the first 10 rows of the dataframe. 

In [12]:
df = pd.json_normalize(a)

df.head(10)

Unnamed: 0,visibility,response,guests,rsvp_id,mtime,venue.venue_name,venue.lon,venue.lat,venue.venue_id,member.member_id,...,group.group_id,group.group_name,group.group_lon,group.group_urlname,group.group_lat,group.group_state,member.other_services.facebook.identifier,member.other_services.twitter.identifier,member.other_services.flickr.identifier,member.other_services.tumblr.identifier
0,public,yes,0,1872717810,1620498334477,Victoria Park,-0.038972,51.53656,26148882.0,261343705,...,32511274,London Philosophy Collective,-0.1,London-Philosophy-Collective,51.52,,,,,
1,public,no,0,1872032843,1620498339708,,,,,199239918,...,2689942,Adelaide Walkers & Joggers,138.6,Adelaide-Walkers-Joggers,-34.93,,,,,
2,public,yes,0,1872717815,1620498341704,Bluemont Park Picnic Pavilion,-77.130684,38.869236,26308501.0,223227249,...,27673469,Smiley Social,-77.1,Smiley-Social,38.89,VA,,,,
3,public,no,0,1870719986,1620498341802,General Aviation Center (GAC),8.57379,47.45337,27074423.0,312439269,...,35015193,Fly Zürich,8.54,fly-zurich,47.38,,,,,
4,public,yes,0,1872717816,1620498342950,Online event,179.1962,-8.521147,26906060.0,331379582,...,24897058,Spiritual Experiences Group of Steamboat Springs,-106.85,Spiritual-Experiences-Group-of-Steamboat-Springs,40.48,CO,,,,
5,public,yes,0,1872717817,1620498343772,Online event,179.1962,-8.521147,26906060.0,63075752,...,20117855,Thinking While Drinking SD,-117.17,thinking-while-drinking-sd,32.72,CA,,,,
6,public,yes,0,1872717818,1620498343992,La Pecera del Círculo de Bellas Artes,-3.696639,40.41842,27035437.0,277023324,...,10124532,MADRIDBABEL: GREAT LANGUAGE EXCHANGES AND MORE...,-3.71,MADRIDBABEL-GREAT-SOCIAL-ACTIVITIES-AND-LANGUA...,40.42,,,,,
7,public,yes,1,1872717819,1620498345786,635 Middle Country Rd,-73.0228,40.86829,26063945.0,329194397,...,3460952,Long Island Soccer,-73.59,Long-Island-Soccer,40.7,NY,,,,
8,public,yes,0,1872717820,1620498346666,Compton,-1.336569,51.023594,27067502.0,261515230,...,34910604,Walk Before You Climb,-1.33,walk-before-you-climb-group,51.07,,,,,
9,public,yes,0,1872717823,1620498347996,Georgetown Art Center,-97.676987,30.635921,23599847.0,331167042,...,31598909,Austin Art Lovers,-97.71,Austin-Art-Lovers,30.27,TX,,,,


#### Create the EVENT and RSVP tables

Build the EVENT and RSVP dataframes with the specified fields and view the first 10 rows of each. 

In [13]:
# Create the EVENT and RSVP dataframes. 
event = df[['event.event_id', 'event.event_name', 'event.time', 'venue.venue_name', 'venue.lat','venue.lon', 'group.group_country','group.group_name','group.group_urlname']]
rsvp = df[['rsvp_id', 'mtime', 'response', 'event.event_id', 'guests']]

In [14]:
event.head()

Unnamed: 0,event.event_id,event.event_name,event.time,venue.venue_name,venue.lat,venue.lon,group.group_country,group.group_name,group.group_urlname
0,278041698,Philosophy in the Park 013,1621685000000.0,Victoria Park,51.53656,-0.038972,gb,London Philosophy Collective,London-Philosophy-Collective
1,277931938,Wine Shanty Altitude Hike. 8:30 am to 11. Abo...,1620515000000.0,,,,au,Adelaide Walkers & Joggers,Adelaide-Walkers-Joggers
2,278065424,Volleyball 101 - Absolute Beginners Lessons - ...,1621089000000.0,Bluemont Park Picnic Pavilion,38.869236,-77.130684,us,Smiley Social,Smiley-Social
3,277609495,Adventure Flight,1626016000000.0,General Aviation Center (GAC),47.45337,8.57379,ch,Fly Zürich,fly-zurich
4,277648513,"Experience HU, The Sound of Soul",1620578000000.0,Online event,-8.521147,179.1962,us,Spiritual Experiences Group of Steamboat Springs,Spiritual-Experiences-Group-of-Steamboat-Springs


In [15]:
rsvp.head()

Unnamed: 0,rsvp_id,mtime,response,event.event_id,guests
0,1872717810,1620498334477,yes,278041698,0
1,1872032843,1620498339708,no,277931938,0
2,1872717815,1620498341704,yes,278065424,0
3,1870719986,1620498341802,no,277609495,0
4,1872717816,1620498342950,yes,277648513,0


#### Create ProfileReport using pandas-profiling for EVENTS and RSVP tables

Use the pandas profiling tool to create an EDA report for both of the dataframes created for this assessment. 

In [20]:
# To create the eda report for the EVENTS table run this code on the events and rsvp dataframes

profileEvents = ProfileReport(event, title="Pandas Profiling Report", explorative=True)

In [21]:
# To create the eda report for the RSVP table run this code on the events and rsvp dataframes

profileRsvp=ProfileReport(rsvp, title="Pandas Profiling Report", explorative=True)

#### ProfileReport for RSVP table

The pandas-profiling module creates the detailed interactive report that will open in below the Jupyter Notebook code block where it is located. By clicking on the tabs and sub-tabs in the report, the viewer can review a detailed analysis of the data table. 

In [28]:
# This code generates a ProfileReport for the RSVP table in the Jupyter Notebook

profileRsvp.to_widgets()

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=18.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render widgets'), FloatProgress(value=0.0, max=1.0), HTML(value='')))

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

#### Shareable ProfileReport for RSVP in HTML

The following code produces a sharable ProfileReport for the EVENTS table

In [29]:
# html report generated for the profileRSVP table.

profileRsvp.to_notebook_iframe()

HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




#### Summary of ProfileReport Findings for RSVP table. 

The report identifies several areas to highlight that may be an issue.
- There are 565 (0.8%) duplicate rows matched in rsvp_id. 
- The guests columns has 64500 zeros (95.3%).    

#### ProfileReport Outline Details for RSVP Table"

__I. Overview Tab__

A. Overview

- Duplicate Rows: 565 duplicate rows matched on a specific rsvp_id 

B. Warnings

- High cardinality in several of the columns
- Skewed data in the columns but this is irrelevant given the type of data

C. Reproduction

- Simply a time stamp overview this report.

__II. Variables Tab__

- this tab generates drop down tabs for each of the columns in the dataframe. This repeats for each of the columns so I will only highlight the structure of the first columns. There is a nothing of obvious interest in this section for the RSVP table.

__III. Interactions Tab__

- This creates a series of interactive plots that can be used to assess interactions between dependent and independent variables for columns with numeric values. This is not useful information for this data table.

__IV. Correlations Tab__

- This tab creates heat map correlation tables using Pearson's r, Spearman's p, Kendall's t, or PhiK ($\phi$k) tests. The PhiK test adds response (Yes or No) to the correlation table. There does not appear to be much useful information here.

__V. Missing Values Tab__
    
- This identifies the occurence of missing values using Count, Matrix, Heatmap, and Dendrogram methods. There are no missing values in the RSVP table

__VI. Sample Tab__

- This generates a sample view of 10 rows of the data frame. There is nothing of obvious interest here. 

__VII. Duplicate Rows Tab__

- This shows a sample of the top 10 duplicate rows and the count of those occurences. There are duplications based on the rsvp_id field. This may be a flag to investigate further to ascertain the reason for this duplication. 



#### ProfileReport for EVENTS table

The pandas-profiling module creates the detailed interactive report shown below. Clicking on the tabs inside the report shows detailed information, with multiple layers deep, 

In [22]:
# This code generates a ProfileReport for the EVENTS table in the Jupyter Notebook

profileEvents.to_widgets()

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=22.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render widgets'), FloatProgress(value=0.0, max=1.0), HTML(value='')))

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

#### Shareable ProfileReport for the EVENTS table in HTML

The following code produces a sharable ProfileReport for the EVENTS table

In [49]:
# html report generated for the profileRSVP table.

profileEvents.to_notebook_iframe()

#### Summary of ProfileReport Findings for EVENTS table. 

The report identifies several areas to highlight that may be an issue.
- Missing 2.8% of the cells overall. 
- Missing 8.4% of the venue name and location columns, lat. and lon.
- Duplicates of character strings identified in the event ID column.
- The columns event_name, group_name, and group_url are populated with combinations of mixed case, all capital letter, all lower case letters and occasional emoji's in the same cells. The emoji's in the data may be a reason for a flag on this issue.
- There are 11,906 duplicate rows that need further investigation to ascertain if there is a rational reason for the duplication.
    

#### Detailed Outline of ProfileReport for EVENTS table"

__I. Overview Tab__

A. Overview

- Missing Cells: 17041 (2.8%)
- Duplicate Rows: 11906 (what does a "duplicate row" look like?)

B. Warnings

- High cardinality in several of the columns
- Missing values in 8.4% of the venue name and locations (lat., lon.) columns

C. Reproduction

- Simply a time stamp overview this report.

__II. Variables Tab__

- this tab generates drop down tabs for each of the columns in the dataframe. This repeats for each of the columns so I will only highlight the structure of the first columns. There is a lot of information here.

A. Event.Event_Id
- Overview
- Unique: 14643 (21.6% of total in this column)
- Categories: appears to show duplicate character strings
- Words 
    - appears to show duplicate character strings
- Character: 4 tabs below this tab. This is a count overview.
    - Characters: most frequent character count
    - Categories: count of lower case letters, decimal numbers, etc. Appears incomplete.
    - Scripts: Most common scripts appearing ranked. Gives information on count of different languages
    - Blocks: Count of characters per block. I.e. ASCII block counts by frequency of character. 
__III. Interactions Tab__

- This creates a series of interactive plots that can be used to assess interactions between dependent and independent variables for columns with numeric values. In this table only event: time, lat. , lon. apply so this is not useful information.

__IV. Correlations Tab__

- This tab creates heat map correlation tables using Pearson's r, Spearman's p, Kendall's t, or PhiK ($\phi$k) tests. The PhiK test adds group_country to the correlation table along with event: time, lat. and lon. so there may is some usable information between those four variable. 

__V. Missing Values Tab__
    
- This identifies the occurence of missing values using Count, Matrix, Heatmap, and Dendrogram methods. We see that venue_name, venue_lat, and venue_lon are the columns with the missing values.

__VI. Sample Tab__

- This generates a sample view of 10 rows of the data frame. We can see that there are both number and letter values for event_id and there are both upper and lower case and also emoji's used in the event_name, group_name, and grou-url columns.

__VII. Duplicate Rows Tab__

- This shows a sample of the top 10 duplicate rows and the count of those occurences. This is a flag to investigate further to ascertain the reason for this duplication. 



#### Code to Generate and Save Shareable HTML Reports for  EVENTS and RSVP Tables

In [30]:
# This is the code to generate and save an html report that can be shared externally.

profileEvents.to_file(r"C:\Users\pchri\Chris Schmidt\EVENTSreport.html")

HBox(children=(HTML(value='Export report to file'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [32]:
# This is the code to generate and save an html report that can be shared externally.

profileRsvp.to_file(r"C:\Users\pchri\Chris Schmidt\RSVPreport.html")

HBox(children=(HTML(value='Export report to file'), FloatProgress(value=0.0, max=1.0), HTML(value='')))


