# Describe: 

-Our data comes from the Edinburgh Festival API provided by Data Host, it contains all the festival information of Edinburgh Festival from 2012 to 2020, among which we mainly analyze the data of Fringe Festival from 2015 to 2019. Our data host hopes that we can find or predict which festival or event will become popular in the future.

The process of getting the data from the API is not shown, please contact me if you need.

○ what is the general type of the data (tabular, network, geographical, textual etc.)

-The format of the data is JSON, which is mainly hierarchical data, but it also includes latitude and longitude, which I think is also a form of geographic data.

○ how large and complex is it (rows/columns, size, variation, structure) 

-There are 4,257 pieces of data from 2019, 3,985 pieces of data from 2018, and 3,795 pieces of data from 2017, each of which represents an event.Each CSV file is larger than 23M

○ What fields and data types are present (max/min, levels for categorical values).

-Each event has 34 keys, most of which are string, float, bool, int, etc., which can be analyzed, some keys are null, and some are nested with dict and list. The maximum and minimum number of performers can be counted, genre, artist type and age limit can be classified

○ Links between this data and other data (e.g. foreign keys, unique ids) 

-There is no links between festivals, but in our subsequent analysis, we will use Festival-code as the foreign key to connect festivals and their nested lists and dict

In [1]:
# import sys # never mind these two commands.
import pandas as pd
import numpy as np
import csv
import json
import seaborn as sns     
import matplotlib.pyplot as plt
%matplotlib inline

data_list2019 = json.loads(open('fringe_2019.json').read())
fringe_2019 = pd.DataFrame(data_list2019)
data_list2018 = json.loads(open('fringe_2018.json').read())
fringe_2018 = pd.DataFrame(data_list2018)
data_list2017 = json.loads(open('fringe_2017.json').read())
fringe_2017 = pd.DataFrame(data_list2017)
fringe_2019.describe()

Unnamed: 0,sub_venue,year,fringe_first,performers_number,sub_title,description_teaser,longitude,non_english,latitude
count,0.0,4257.0,0.0,4257.0,0.0,0.0,4257.0,0.0,4257.0
mean,,2019.0,,4.807846,,,-3.18924,,55.948427
std,,0.0,,8.845754,,,0.012095,,0.007872
min,,2019.0,,1.0,,,-3.381604,,55.608425
25%,,2019.0,,1.0,,,-3.192293,,55.945854
50%,,2019.0,,2.0,,,-3.187818,,55.948156
75%,,2019.0,,5.0,,,-3.185542,,55.950167
max,,2019.0,,200.0,,,-2.990308,,55.980849


    Let's take fringe_2019 as an example to talk about the data. 

First of all, ‘sub_venue’，‘fringe_first’，‘sub_title’，‘description_teaser’ ，‘non_english’are all NaN, which will not be analyzed.

Then, count is 4257 values, demonstrating that year, Performers Number, Longtitude, latitude are no NaN value

And finally, API Document says：‘ Where no value is available, the API will return null values - you should ensure your application treats and displays these values as "Unknown" rather than for example as equivalent to a boolean false or numerical zero’. So if there's null values later on, I'm going to fill it with 'unknown'.

* In columns disabled, update_times, discounts, performance_space, categories, and venue, the data in each row has changed into dictionary datatype.
- In column performances, the data in each row has changed into list datatype, and in each list, each item is in dictionary datatype.

In [2]:

def preprocess(df):
    disabled = []
    update = []
    discounts = []
    space = []
    categories = []
    venue = []
    performances = []
    for i in range(0, len(df)):
        disabled.append(eval(df['disabled'][i]))
        update.append(eval(df['update_times'][i]))
        discounts.append(eval(df['discounts'][i]))
        space.append(eval(df['performance_space'][i]))
        categories.append(eval(df['categories'][i]))
        venue.append(eval(df['venue'][i]))
        performances.append(list(eval(df['performances'][i])))
    df['disabled'] = disabled
    df['update_times'] = update
    df['discounts'] = discounts
    df['performance_space'] = space
    df['categories'] = categories
    df['venue'] = venue
    df['performances'] = performances
    return

preprocess(fringe_2019)
preprocess(fringe_2018)
preprocess(fringe_2017)

Write 2019‘s disabled in dict format into CSV file for convenient data processing later

I use 'festival_code' as foreign key, in order to do the follow-up analysis, I can better map to the specific festival

In [3]:

disc_list = fringe_2019['disabled'].tolist()
disc=disc_list 
with open('disa_fringe_2019.csv','w',newline='',encoding='utf-8')as f:
    disc_fieldnames = list(disc[0].keys())
    # use 'festival_code' as foreign key
    disc_fieldnames.append('festival_code')
    
    writer = csv.DictWriter(f,fieldnames=disc_fieldnames)
    writer.writeheader()
    for i in range(0, len(disc)):
        temp = disc[i]
        temp['festival_code'] = fringe_2019['code'][i]
        writer.writerow(temp)

In [4]:
#Some simple data cleansing processes, take 2019 as an example
#I deleted all the events of three disability services== unknown and sorted out a new CSV file for later use
data = pd.read_csv("disa_fringe_2019.csv")
data=data.fillna('Unknown')
data=data[~(data['audio'].isin(['Unknown']) & data['signed'].isin(['Unknown'])&data['captioning'].isin(['Unknown']))]
data.to_csv("disa_2019.csv")
