# Data Wranging -> Preparing Training Data

Our training data comes nicely to us from the spider database files 'train_spider.json' and 'train_others.json'.

The first was built by the team that prepared this data and the second is a collection based on common public datasets.

Included in the files are training questions with the corresponding schema_id, correct query to answer the question, and some parsed out data for the question and query.

I could actually pull the tokenized words from the question, but I want that practice so I'll do that. I also haven't thought of a good use-case to pull in the query information (I wouldn't know how to train on it) but I'll pull it anyway and then have a version of the data with full question, target schema, and target query. And then train off of only the 1st two.

In [1]:
import json
import pandas as pd

## Spider Training Data

### Load json

In [2]:
path = '../data/raw/spider/'

with open(path+'train_spider.json', "r") as f:
    spi_train = json.load(f)

### Create Pandas DF
With question and corresponding schema

In [3]:
spi_train_list = []

for i in range(len(spi_train)):
    ques = spi_train[i]['question']
    schem = spi_train[i]['db_id']
    query = spi_train[i]['query']
    record = [ques, schem, query]
    spi_train_list.append(record)

In [4]:
spi_train_df = pd.DataFrame(spi_train_list, columns=['question', 'schema', 'query'])

In [5]:
spi_train_df.head()

Unnamed: 0,question,schema,query
0,How many heads of the departments are older th...,department_management,SELECT count(*) FROM head WHERE age > 56
1,"List the name, born state and age of the heads...",department_management,"SELECT name , born_state , age FROM head ORD..."
2,"List the creation year, name and budget of eac...",department_management,"SELECT creation , name , budget_in_billions ..."
3,What are the maximum and minimum budget of the...,department_management,"SELECT max(budget_in_billions) , min(budget_i..."
4,What is the average number of employees of the...,department_management,SELECT avg(num_employees) FROM department WHER...


## Other Training Data

### Load json

In [6]:
path2 = '../data/raw/spider/'

with open(path2+'train_others.json', "r") as f:
    oth_train = json.load(f)

### Create Pandas DF

With question and corresponding schema

In [7]:
oth_train_list = []

for i in range(len(oth_train)):
    ques = oth_train[i]['question']
    schem = oth_train[i]['db_id']
    query = oth_train[i]['query']
    record = [ques, schem, query]
    oth_train_list.append(record)

In [8]:
oth_train_df = pd.DataFrame(oth_train_list, columns=['question', 'schema','query'])

In [11]:
oth_train_df.head()

Unnamed: 0,question,schema,query
0,what is the biggest city in wyoming,geo,SELECT city_name FROM city WHERE population =...
1,what wyoming city has the largest population,geo,SELECT city_name FROM city WHERE population =...
2,what is the largest city in wyoming,geo,SELECT city_name FROM city WHERE population =...
3,where is the most populated area of wyoming,geo,SELECT city_name FROM city WHERE population =...
4,which city in wyoming has the largest population,geo,SELECT city_name FROM city WHERE population =...


## Combine Dataframes

In [14]:
full_train = pd.concat([spi_train_df, oth_train_df], axis=0, ignore_index=True)

In [24]:
#did the concat work. Let's check the counts
print(spi_train_df['question'].count())
print(oth_train_df['question'].count())
print(full_train['question'].count())

7000
1659
8659


That looks good! I'll now prep for the next step by saving this to the interim directory, and then create a df with just the first 2 columns for training and move it to the processed directory.

## Prep For Next Steps

### Save full data to the interim folder

In [25]:
#commenting out after running so I don't re-run unecessarily

#filepath = '../data/interim/full_training_data.csv'
#full_train.to_csv(filepath, index=False)

### Create Simpler Training File and Save to Processed Folder

In [26]:
#create new with just the first two columns
training_data = full_train[['question','schema']]

In [29]:
#look to make sure things copied correctly
print(training_data.head())
print('--'*50)
print(training_data.info())

                                            question                 schema
0  How many heads of the departments are older th...  department_management
1  List the name, born state and age of the heads...  department_management
2  List the creation year, name and budget of eac...  department_management
3  What are the maximum and minimum budget of the...  department_management
4  What is the average number of employees of the...  department_management
----------------------------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8659 entries, 0 to 8658
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  8659 non-null   object
 1   schema    8659 non-null   object
dtypes: object(2)
memory usage: 135.4+ KB
None


In [30]:
#export csv
filepath2 = '../data/processed/training_data.csv'
training_data.to_csv(filepath2, index=False)