## Cleaning

I need to do all the cleaning here ans spit out files that will serve in two fronts:
* creating vars_dic: For this we mostly need the column names and the meta data.
    * This will in turn create all the models and views and forms related to them.
* creating SQL scripts to populate the database. So we need clean files for that. where we know exactly what the null values are, what we do with them in that particular file and everything.

In [1]:
import csv
import pandas as pd
from csv_cleaner_helpers import *

# NO SOURCE for this data as of Sep 7th.

df_agricultural_data = pd.read_csv("Qing_Datasets/CrisisDB_agricultural_data.csv")
df_agricultural_data.head(2)

Unnamed: 0,Year,Total Population (1000),Agricultural population (1000),Arable land (1000 mu),Arable land per farmer (mu),Gross grain shared per agricultural population (catties per capita),Net grain shared per agricultural population (catties per capita),Surplus
0,1600,120000,97200,725464,27.52,1765,819,469
1,1766,200000,170000,1036109,25.22,1700,789,439


In [2]:
df_agricultural_data.columns = df_agricultural_data.columns.str.replace(" ", "_")

df_agricultural_data.columns = df_agricultural_data.columns.str.lower()
# the followings can be more easily done via the csv file itself here in Jupyter envirinment.
df_agricultural_data.columns = df_agricultural_data.columns.str.replace("_\(catties_per_capita\)", "")
df_agricultural_data.columns = df_agricultural_data.columns.str.replace("_\(1000\)", "")
df_agricultural_data.columns = df_agricultural_data.columns.str.replace("_\(mu\)", "")
df_agricultural_data.columns = df_agricultural_data.columns.str.replace("_\(1000_mu\)", "")

df_agricultural_data.rename(columns = {'year':'year_from'}, inplace = True)

df_agricultural_data = add_polity_col(df_agricultural_data)
df_agricultural_data = add_year_to(df_agricultural_data)

### Save the `CLEAN` file
#### TO: `Qing_Datasets_CLEAN/`

In [3]:
# create separate dfs based on models in Django (tentative)
file_name = "CrisisDB_agricultural_data"
agricultural_data_vars_dic = {}
for column in list(df_agricultural_data.columns):
    if column == "year_from" or column == "year_to" or column == "polity_id":
        continue
    df_key = "df_" + column
    df_value = df_agricultural_data[["polity_id", "year_from", "year_from", column]].copy()
    agricultural_data_vars_dic[df_key] = df_value
    # CLEAN_CSV file saving:
    my_address = "Qing_Datasets_CLEAN/" + str(df_key) + "_FROM_" + file_name + "_CLEAN.csv"
    df_value.to_csv(my_address, index = False)

### Add 
- `total_population`, (probably already there)
- `agricultural_population`,
- `arable_land`,
- `arable_land_per_farmer`,
- `gross_grain_shared_per_agricultural_population`,
- `net_grain_shared_per_agricultural_population`,
- `surplus`,

### To
- `resulting_vars_dic`

In [4]:
# We don't have any explanations at the moment
agri_explanations = {
    # 'total_population', "No Explanations." (will be there through other files)
       'agricultural_population': "No Explanations.", 
        'arable_land': "No Explanations.", 
        'arable_land_per_farmer': "No Explanations.", 
       'gross_grain_shared_per_agricultural_population': "No Explanations.", 
       'net_grain_shared_per_agricultural_population': "No Explanations.",
        'surplus': "No Explanations.", 
}

my_ints_dic_AGRI = {}
for col, col_exp in agri_explanations.items():
    my_ints_dic_AGRI[col] = {}
    if col in ['agricultural_population']:
        my_ints_dic_AGRI[col][col] = {
            'var_exp': col_exp,
            'units': "People",
            'min': 0,
            'scale': 1000}
    elif col in ['arable_land',]:
        my_ints_dic_AGRI[col][col] = {
            'var_exp': col_exp,
            'units': "mu?",
            'scale': 1000,}
    elif col in ['arable_land_per_farmer',]:
        my_ints_dic_AGRI[col][col] = {
            'var_exp': col_exp,
            'units': "mu?",}
    elif col in ['gross_grain_shared_per_agricultural_population',
                 'net_grain_shared_per_agricultural_population', 'surplus']:
        my_ints_dic_AGRI[col][col] = {
            'var_exp': col_exp,
            'units': "(catties per capita)",}
    else:
        print("ERROR: ", col, " is not in the list of options.")

for df_key, df_value in agricultural_data_vars_dic.items():
    var_name = df_key[3:]
    df_var = df_value
    if var_name == "total_population":
        continue
    my_ints_dic = my_ints_dic_AGRI[var_name]
    vars_dic_entry_maker(var_name, df_var,
                        my_ints_dic, {}, {}, {},
                         "Economy Variables", "Productivity",
                         notes = "Notes on the Variable",
                         main_desc = agri_explanations.get(var_name, "No Main Descriptions Provided."),
                        )
my_ints_dic_AGRI.keys()
#df_value

Hit me:  agricultural_population
Hit me:  arable_land
Hit me:  arable_land_per_farmer
Hit me:  gross_grain_shared_per_agricultural_population
Hit me:  net_grain_shared_per_agricultural_population
Hit me:  surplus


dict_keys(['agricultural_population', 'arable_land', 'arable_land_per_farmer', 'gross_grain_shared_per_agricultural_population', 'net_grain_shared_per_agricultural_population', 'surplus'])

<hr style="margin-top:0px;height:18px;background-image: linear-gradient(to right, red,orange,green,blue,indigo,violet);">

- Is there any way we can give this exact thing as the input the vars_dic.
- At the end of the day, we want to create the vars_dic based on these individual files.
- So we can potentially (and actually more practically) provide the data: columns and metadata and units and everything and let Python create the vars_dic for us. Isn't that cool.
- we only need to make sure we feed all the information to a mediator function between `our files` and our `Django code`. 

#### What does `vars_dic` need?

I believe we can use the data in both `ABC.csv` file and `ABC_metadata.csv` for this:

- A list of column names.
- The name of the main column (or the name of the variable)
- number of columns in the variable.
- section name
- subsection name
- description for each column
- main description for the model (variable)
- datatype for the column: MYINT, MYDEC etc.
- units
- What null values mean

<hr style="margin-top:0px;height:18px;background-image: linear-gradient(to right, red,orange,green,blue,indigo,violet);">

In [5]:
import csv
import pandas as pd
from csv_cleaner_helpers import *

# Source??

df_milexpenses = pd.read_csv("Qing_Datasets/qing_milexpense.csv", delimiter="|")
df_milexpenses.head(15)

Unnamed: 0,Conflict,Period,Expenditure,Note
0,Three Feudatories,1674-1681,100.0,in millions silver taels
1,Dzunghar,1715-1726,50.0,
2,Taiwan rebellion,1721,9.0,
3,Taiwan rebellion,1787-1788,10.0,
4,White Lotus rebellion,1796-1804,53.0,
5,Gurkha campaigns,1788-1792,10.0,
6,Ocean Bandit campaigns,1802-1810,7.0,
7,Taiping Rebellion,1850-1864,290.0,
8,southwestern rebellions,1851-1873,518.0,
9,Xinjiang expedition,1875-1878,75.0,


We will be dealing with two types of notes:
* One note is the type we have here which is a `note_meta` for the whole variable and this will go into the variablehierarchy notes.
* A secind type is `note_specific` related to a specific row of data. Something we don't have here.

In [6]:
df_milexpenses.columns = df_milexpenses.columns.str.lower()

In [7]:
df_milexpenses[["year_from", "year_to"]] = df_milexpenses.apply(period_to_year, separator= "-", 
                                                                col_name = "period" , axis=1)

In [8]:
del df_milexpenses["period"]
del df_milexpenses["note"]

In [9]:
df_milexpenses = add_polity_col(df_milexpenses)
df_milexpenses = move_col_to(df_milexpenses, "year_from", 1)
df_milexpenses = move_col_to(df_milexpenses, "year_to", 2)

df_milexpenses.expenditure = df_milexpenses.expenditure.astype("float")

In [10]:
df_milexpenses.head(2)

Unnamed: 0,polity_id,year_from,year_to,conflict,expenditure
0,CnQingE,1674,1681,Three Feudatories,100.0
1,CnQingE,1715,1726,Dzunghar,50.0


In [11]:
df_milexpenses

Unnamed: 0,polity_id,year_from,year_to,conflict,expenditure
0,CnQingE,1674,1681,Three Feudatories,100.0
1,CnQingE,1715,1726,Dzunghar,50.0
2,CnQingE,1721,1721,Taiwan rebellion,9.0
3,CnQingE,1787,1788,Taiwan rebellion,10.0
4,CnQingL,1796,1804,White Lotus rebellion,53.0
5,CnQingE,1788,1792,Gurkha campaigns,10.0
6,CnQingL,1802,1810,Ocean Bandit campaigns,7.0
7,CnQingL,1850,1864,Taiping Rebellion,290.0
8,CnQingL,1851,1873,southwestern rebellions,518.0
9,CnQingL,1875,1878,Xinjiang expedition,75.0


### Save the `CLEAN` file

In [13]:
my_address = "Qing_Datasets_CLEAN/military_expenditure_CLEAN.csv"
df_milexpenses.to_csv(my_address, index = False)

### Add `disease_outbreak` to `resulting_vars_dic`

In [14]:
# for each variable we need a set of four dictionaries
my_floats_dic = {
    'expenditure': {
        'var_exp': "The military expenses in millions silver taels.",
        'units': "millions silver taels",
                },
}

my_chars_dic = {
    'conflict': {
        'var_exp': "The name of the conflict",
                },
}

vars_dic_entry_maker("military_expense", df_milexpenses,
                     {}, my_floats_dic, my_chars_dic, {},
                    "Economy Variables", "State Finances",
                     notes = "Not sure about Section and Subsection.",
                    main_desc_source = "https://en.wikipedia.org/wiki/Disease_outbreak",
                     null_meaning = "The value is not available.")

<hr style="margin-top:0px;height:18px;background-image: linear-gradient(to right, red,orange,green,blue,indigo,violet);">

In [15]:
from csv_cleaner_helpers import *
import pandas as pd

# source has no Zotero link yet.
source= "李隆生《清代的国际贸易：白银流入、货币危机和晚清工业化》，台湾秀威科技股份有效公司2010年版，第148-155页。"

df_silver_inflow = pd.read_csv("Qing_Datasets/qing_silver_inflow.csv", delimiter="|")
df_silver_inflow.head(3)

Unnamed: 0,year,silver_inflow,silver_stock
0,1645,85,85
1,1646,155,240
2,1647,155,395


In [16]:
df_silver_inflow = df_silver_inflow.dropna().reset_index(drop=True)
df_silver_inflow.rename(columns= {"year": "year_from"}, inplace=True)

In [17]:
df_silver_inflow = add_polity_col(df_silver_inflow)
df_silver_inflow = add_year_to(df_silver_inflow)

In [18]:
# I will make two models out of these
# The numbers associated with be integers
# The units will be million taels
df_silver_inflow.head(3)

Unnamed: 0,polity_id,year_from,year_to,silver_inflow,silver_stock
0,CnQingE,1645,1645,85,85
1,CnQingE,1646,1646,155,240
2,CnQingE,1647,1647,155,395


### Save the `CLEAN` file

In [19]:
# create separate dfs based on models in Django (tentative)
silver_data_dic = {}
for column in list(df_silver_inflow.columns):
    if column == "year_from" or column == "year_to" or column == "polity_id":
        continue
    df_key = "df_" + column
    df_value = df_silver_inflow[["polity_id", "year_from", "year_to", column]].copy()
    silver_data_dic[df_key] = df_value
    # CLEAN_CSV file saving:
    my_address = "Qing_Datasets_CLEAN/" + str(df_key) + "_CLEAN.csv"
    df_value.to_csv(my_address, index = False)

### Add 
- `silver_inflow`,
- `silver_stock`

### TO:
- `resulting_vars_dic`

In [20]:
silver_explanations = {
    "silver_inflow": "Silver inflow in Millions of silver taels??",
    "silver_stock": "Silver stock in Millions of silver taels??",
}

my_ints_dic_silver = {}
for col, col_exp in silver_explanations.items():
    my_ints_dic_silver[col] = {}
    my_ints_dic_silver[col][col] = {
        'var_exp': col_exp,
        'units': "Millions of silver taels??",
        'scale': 1000000,
    }

for df_key, df_value in silver_data_dic.items():
    var_name = df_key[3:]
    df_var = df_value
    my_ints_dic = my_ints_dic_silver[var_name]
    vars_dic_entry_maker(var_name, df_var,
                        my_ints_dic, {}, {}, {},
                         "Economy Variables", "State Finances",
                         notes = "Needs suoervision on the units and scale.",
                         main_desc = silver_explanations[var_name],
                        )
my_ints_dic.keys()
#df_value

Hit me:  silver_inflow
Hit me:  silver_stock


dict_keys(['silver_stock'])

<hr style="margin-top:0px;height:18px;background-image: linear-gradient(to right, red,orange,green,blue,indigo,violet);">

# Total Population

In [21]:
from csv_cleaner_helpers import *

In [22]:
df_China_total_population = pd.read_csv("Qing_Datasets/ClioInfra_China_TotalPopulation.csv", delimiter="|")
df_China_total_population.head(3)

Unnamed: 0,Country Code,Country Name,Indicator,Year,Data
0,156,China,Total Population,1500,103000.0
1,156,China,Total Population,1501,
2,156,China,Total Population,1502,


In [23]:
#df_China_toal_population = df_China_toal_population.dropna(axis=0)
df_China_total_population = df_China_total_population.dropna().reset_index(drop=True)

In [24]:
del df_China_total_population["Country Code"]
del df_China_total_population["Indicator"]
del df_China_total_population["Country Name"]

df_China_total_population.rename(columns= {"Year": "year_from",
                                          "Data": "total_population"}, inplace=True)

In [25]:
df_China_total_population["polity_id"] = df_China_total_population.apply(polity_id_decider_qing, axis=1)
# bring polity_id col to the front of the df:
backup = df_China_total_population["polity_id"]
df_China_total_population.drop(labels=["polity_id"], axis=1,inplace = True)
df_China_total_population.insert(0, 'polity_id', backup)

In [26]:
df_China_total_population.total_population = df_China_total_population.total_population.astype("int")

In [27]:
df_China_total_population["year_to"] = df_China_total_population["year_from"]
backup = df_China_total_population["year_to"]
df_China_total_population.drop(labels=["year_to"], axis=1,inplace = True)
df_China_total_population.insert(2, 'year_to', backup)

In [28]:
df_China_total_population.tail(3)

Unnamed: 0,polity_id,year_from,year_to,total_population
23,CnQingL,1980,1980,981861
24,CnQingL,1990,1990,1135185
25,CnQingL,2000,2000,1262645


### Save the `CLEAN` file

In [29]:
my_address = "Qing_Datasets_CLEAN/df_total_population_CLEAN.csv"
df_China_total_population.to_csv(my_address, index = False)

In [30]:
# This is to replace all NaNs with Nones
# df_China_toal_population = df_China_toal_population.where(pd.notnull(df_China_toal_population), None)

### Add `total_population` to `resulting_vars_dic`

In [32]:
my_ints_dic = {
    'total_population': {
        'var_exp': "The total population of a country (or a polity).",
        'units': "People",
        'min': 0,
        'scale': 1000,
                },}

vars_dic_entry_maker("total_population", df_China_total_population,
                    my_ints_dic,{},{},{},
                    "Social Complexity Variables", "Social Scale",
                    notes = "Note that the population values are scaled.",
                    main_desc = "Total population or simply population, of a given area is the total number of people in that area at a given time.",
                    null_meaning = "The value is not available.")

Hit me:  total_population


In [33]:
# This is already in Zotero: 975BEGKF
df_China_toal_population_meta = pd.read_csv("Qing_Datasets/ClioInfra_China_TotalPopulation_metadata.csv", delimiter="|")
df_China_toal_population_meta

Unnamed: 0,Description,Value
0,Downloaded from,https://www.clio-infra.eu/IndicatorsPerCountry...
1,Text Citation,"Fink-Jensen, Jonathan (2015). Total Population..."
2,XML Citation,https://www.clio-infra.eu/Citations/DOI-10622_...
3,RIS Citation,https://www.clio-infra.eu/Citations/DOI-10622_...
4,BIB Citation,https://www.clio-infra.eu/Citations/DOI-10622_...


<hr style="margin-top:0px;height:18px;background-image: linear-gradient(to right, red,orange,green,blue,indigo,violet);">

In [34]:
from csv_cleaner_helpers import *
import pandas as pd

df_ClioInfra_China_GDPperCapita = pd.read_csv("Qing_Datasets/ClioInfra_China_GDPperCapita.csv", delimiter="|")
df_ClioInfra_China_GDPperCapita.head(4)

Unnamed: 0,Country Code,Country Name,Indicator,Year,Data
0,156,China,GDP per Capita,1500,1207.278
1,156,China,GDP per Capita,1501,
2,156,China,GDP per Capita,1502,
3,156,China,GDP per Capita,1503,


In [35]:
df_ClioInfra_China_GDPperCapita = df_ClioInfra_China_GDPperCapita.dropna().reset_index(drop=True)

del df_ClioInfra_China_GDPperCapita["Country Code"]
del df_ClioInfra_China_GDPperCapita["Indicator"]
del df_ClioInfra_China_GDPperCapita["Country Name"]

df_ClioInfra_China_GDPperCapita.rename(columns= {"Year": "year_from",
                                          "Data": "gdp_per_capita"}, inplace=True)

df_ClioInfra_China_GDPperCapita.gdp_per_capita = df_ClioInfra_China_GDPperCapita.gdp_per_capita.astype("float")
df_ClioInfra_China_GDPperCapita.year_from = df_ClioInfra_China_GDPperCapita.year_from.astype("int")

In [36]:
df_ClioInfra_China_GDPperCapita = add_polity_col(df_ClioInfra_China_GDPperCapita)
df_ClioInfra_China_GDPperCapita = add_year_to(df_ClioInfra_China_GDPperCapita)

In [37]:
df_ClioInfra_China_GDPperCapita.head(4)

Unnamed: 0,polity_id,year_from,year_to,gdp_per_capita
0,CnQingE,1500,1500,1207.278
1,CnQingE,1510,1510,1274.0
2,CnQingE,1520,1520,1166.0
3,CnQingE,1530,1530,1187.0


### Save the `CLEAN` file

In [38]:
my_address = "Qing_Datasets_CLEAN/df_gdp_per_capita_CLEAN.csv"
df_ClioInfra_China_GDPperCapita.to_csv(my_address, index = False)

### Add `gdp_per_capita` to `resulting_vars_dic`

In [41]:
my_floats_dic = {
    'gdp_per_capita': {
        'var_exp': "The Gross Domestic Product per capita, or GDP per capita, is a measure of a country's economic output that accounts for its number of people. It divides the country's gross domestic product by its total population.",
        'var_exp_source': "https://www.thebalance.com/gdp-per-capita-formula-u-s-compared-to-highest-and-lowest-3305848",
        'units': "Dollars (in 2009?)",
                },}

vars_dic_entry_maker("gdp_per_capita", df_ClioInfra_China_GDPperCapita,
                    {},my_floats_dic,{},{},
                    "Economy Variables", "Productivity",
                    notes = "The exact year based on which the value of Dollar is taken into account is not clear.",
                    main_desc = "The Gross Domestic Product per capita, or GDP per capita, is a measure of a country's economic output that accounts for its number of people. It divides the country's gross domestic product by its total population.",
                    main_desc_source = "https://www.thebalance.com/gdp-per-capita-formula-u-s-compared-to-highest-and-lowest-3305848",
                     null_meaning = "The value is not available.")

<hr style="margin-top:0px;height:18px;background-image: linear-gradient(to right, red,orange,green,blue,indigo,violet);">

In [42]:
from csv_cleaner_helpers import *
import pandas as pd

# the third one is the one we are using: MM6AEU7H
sources = ["https://www.ncei.noaa.gov/pub/data/paleo/historical/asia/china/reaches2020drought.txt",
           "https://www.ncei.noaa.gov/pub/data/paleo/historical/asia/china/reaches2020drought-severe.txt",
          "REACHES Chinese Historical Climate Database Qing Dynasty Drought Series https://www1.ncdc.noaa.gov/pub/data/paleo/historical/asia/china/reaches2020drought-category-sites.txt"]

df_REACHES_climate_data = pd.read_csv("Qing_Datasets/REACHES_climate_data.csv", delimiter="|")
df_REACHES_climate_data.head(2)

Unnamed: 0,Country,Year,drought,locust,Socioec.turmoil,crop.failure,famine,Source
0,China,1644.0,49.0,2.0,39.0,34.0,38.0,REACHES Chinese Historical Climate Database Qi...
1,China,1645.0,11.0,7.0,24.0,41.0,26.0,


In [43]:
df_REACHES_climate_data.columns = df_REACHES_climate_data.columns.str.lower()
df_REACHES_climate_data.rename(columns = {'year':'year_from',
                                         'socioec.turmoil':'socioeconomic_turmoil_events',
                                         'famine': 'famine_events',
                                         'crop.failure': "crop_failure_events",
                                         'locust': 'locust_events',
                                         'drought': "drought_events"}, inplace = True)
df_REACHES_climate_data.columns = df_REACHES_climate_data.columns.str.replace(".", "_")
df_REACHES_climate_data = df_REACHES_climate_data.astype({'year_from':'int',
                                                        'drought_events': 'int',
                                                          'locust_events': 'int',
                                                          'crop_failure_events': 'int',
                                                          'socioeconomic_turmoil_events': 'int',
                                                          'famine_events': 'int',})
del df_REACHES_climate_data["country"]
del df_REACHES_climate_data["source"]

In [44]:
df_REACHES_climate_data = add_polity_col(df_REACHES_climate_data)
df_REACHES_climate_data = add_year_to(df_REACHES_climate_data)

In [45]:
df_REACHES_climate_data.head(2)

Unnamed: 0,polity_id,year_from,year_to,drought_events,locust_events,socioeconomic_turmoil_events,crop_failure_events,famine_events
0,CnQingE,1644,1644,49,2,39,34,38
1,CnQingE,1645,1645,11,7,24,41,26


### Save the `CLEAN` file
#### TO: `Qing_Datasets_CLEAN/`

In [46]:
# create separate dfs based on models in Django (tentative)
REACHES_climate_data_dic = {}
for column in list(df_REACHES_climate_data.columns):
    if column == "year_from" or column == "year_to" or column == "polity_id":
        continue
    df_key = "df_" + column
    df_value = df_REACHES_climate_data[["polity_id", "year_from", "year_to", column]].copy()
    REACHES_climate_data_dic[df_key] = df_value
    # CLEAN_CSV file saving:
    my_address = "Qing_Datasets_CLEAN/" + str(df_key) + "_CLEAN.csv"
    df_value.to_csv(my_address, index = False)

### Add 
- `famine_events`,
- `crop_failure_events`,
- `socioeconomic_turmoil_events`,
- `locust_events`,
- `drought_events`

### To
- `resulting_vars_dic`

In [47]:
events_explanations = {
    "famine_events": "number of geographic sites indicating famine",
    "crop_failure_events": "number of geographic sites indicating crop failure",
    "socioeconomic_turmoil_events": "number of geographic sites indicating socioeconomic turmoil",
    "locust_events": "number of geographic sites indicating locusts",
    "drought_events": "number of geographic sites indicating drought",
}

my_ints_dic_REACHES = {}
for col, col_exp in events_explanations.items():
    my_ints_dic_REACHES[col] = {}
    my_ints_dic_REACHES[col][col] = {
        'var_exp': col_exp,
        'units': "Numbers",
        'min': 0,}

for df_key, df_value in REACHES_climate_data_dic.items():
    var_name = df_key[3:]
    df_var = df_value
    my_ints_dic = my_ints_dic_REACHES[var_name]
    vars_dic_entry_maker(var_name, df_var,
                        my_ints_dic, {}, {}, {},
                         "Well Being", "Biological Well-Being",
                         notes = "Notes on the Variable",
                         main_desc = events_explanations[var_name],
                        main_desc_source = "https://www1.ncdc.noaa.gov/pub/data/paleo/historical/asia/china/reaches2020drought-category-sites.txt")
my_ints_dic.keys()
#df_value

Hit me:  drought_events
Hit me:  locust_events
Hit me:  socioeconomic_turmoil_events
Hit me:  crop_failure_events
Hit me:  famine_events


dict_keys(['famine_events'])

<hr style="margin-top:0px;height:18px;background-image: linear-gradient(to right, red,orange,green,blue,indigo,violet);">

In [48]:
from csv_cleaner_helpers import *
import pandas as pd

df_qing_disease = pd.read_csv("Qing_Datasets/qing_disease.csv", delimiter="|")
df_qing_disease.head(2)

Unnamed: 0,PolID,date.from,date.to,long,lat,elevation,Event,Variable,Main.Category,Sub.Category,Magnitude,Duration,Dataset
0,CnQingE,1644,1644,117.391278,39.011594,2.7,340200109.0,Disease Outbreak,34.0,Peculiar Epidemics,Uncertain,No description,REACHES Chinese Historical Climate Database
1,CnQingE,1644,1644,116.259384,38.478592,9.5,340200009.0,Disease Outbreak,34.0,Peculiar Epidemics,Uncertain,No description,REACHES Chinese Historical Climate Database


In [49]:
del df_qing_disease["Event"]
del df_qing_disease["Variable"]
del df_qing_disease["Main.Category"]
del df_qing_disease["Dataset"]

In [50]:
df_qing_disease.rename(columns = {'PolID':'polity_id', 
                                  'date.from':'year_from',
                                 'date.to': 'year_to',
                                 'long': 'longitude',
                                 'lat': 'latitude',
                                 'Sub.Category': 'sub_category',
                                 'Magnitude': 'magnitude',
                                 'Duration': 'duration',}, inplace = True)

In [51]:
df_qing_disease.head(2)

Unnamed: 0,polity_id,year_from,year_to,longitude,latitude,elevation,sub_category,magnitude,duration
0,CnQingE,1644,1644,117.391278,39.011594,2.7,Peculiar Epidemics,Uncertain,No description
1,CnQingE,1644,1644,116.259384,38.478592,9.5,Peculiar Epidemics,Uncertain,No description


### Save the `CLEAN` file

In [52]:
my_address = "Qing_Datasets_CLEAN/disease_outbreak_CLEAN.csv"
df_qing_disease.to_csv(my_address, index = False)

### Add `disease_outbreak` to `resulting_vars_dic`

In [53]:
# for each variable we need a set of four dictionaries
my_floats_dic = {
    'longitude': {
        'var_exp': "The longitude (in degrees) of the place where the disease was spread.",
        'units': "Degrees",
        'min': -180,
        'max': 180,
                },
    'latitude': {
        'var_exp': "The latitude (in degrees) of the place where the disease was spread.",
        'units': "Degrees",
        'min': -180,
        'max': 180,
                },
    'elevation': {
        'var_exp': "Elevation from mean sea level (in meters) of the place where the disease was spread.",
        'units': "Meters",
        'min': 0,
        'max': 5000,
                },
}

my_text_selects_dic = {
    'sub_category': {
        'var_exp': "The category of the disease.",
                },
    'magnitude': {
        'var_exp': "How heavy the disease was.",
                },
    'duration': {
        'var_exp': "How long the disease lasted.",
                },
}

vars_dic_entry_maker("disease_outbreak", df_qing_disease,
                     {}, my_floats_dic, {}, my_text_selects_dic,
                    "Well Being", "Biological Well-Being",
                    main_desc = "A sudden increase in occurrences of a disease when cases are in excess of normal expectancy for the location or season.",
                    main_desc_source = "https://en.wikipedia.org/wiki/Disease_outbreak",
                     null_meaning = "The value is not available.")

<hr style="margin-top:0px;height:45px;background-image: linear-gradient(to right, red,orange,green,blue,indigo,violet);">

In [None]:
dict_1 = {'John': 15, 'Rick': 10, 'Misa' : 12 }
dict_2 = {'Bonnie': 18,'Rick': 20,'Matt' : 16 }
dict_3 = {**dict_1,**dict_2}
print(dict_3)