## Capstone Exercise

* Import relevant python libraries necessary for Python programming and Numpy for doing Numerical operations.

* Import the CSV file – NSMES1988.csv into a dataframe.

* Inspect the data and report the details from physical inspection – rows, columns, data types etc. (multiple functions)

* Find out if the data is clean or if the data has missing values.

* Comment on the data types, their values and their range, specifically on age and income columns.

* Export the data to JSON as NSMES1988.json format file and view and enter your comments.

* Perform memory information analysis and provide recommendations to improve consumption

* Apply recommendations by changing data types

In [43]:
import numpy as np
import pandas as pd

In [44]:
df = pd.read_csv('NSMES1988.csv')

In [45]:
# columns detailed information is displayed 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4406 entries, 0 to 4405
Data columns (total 20 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  4406 non-null   int64  
 1   visits      4406 non-null   int64  
 2   nvisits     4406 non-null   int64  
 3   ovisits     4406 non-null   int64  
 4   novisits    4406 non-null   int64  
 5   emergency   4406 non-null   int64  
 6   hospital    4406 non-null   int64  
 7   health      4406 non-null   object 
 8   chronic     4406 non-null   int64  
 9   adl         4406 non-null   object 
 10  region      4406 non-null   object 
 11  age         4406 non-null   float64
 12  afam        4406 non-null   object 
 13  gender      4406 non-null   object 
 14  married     4406 non-null   object 
 15  school      4406 non-null   int64  
 16  income      4406 non-null   float64
 17  employed    4406 non-null   object 
 18  insurance   4406 non-null   object 
 19  medicaid    4406 non-null  

In [46]:
df.isna().sum()
# Data is clean, no missing data

Unnamed: 0    0
visits        0
nvisits       0
ovisits       0
novisits      0
emergency     0
hospital      0
health        0
chronic       0
adl           0
region        0
age           0
afam          0
gender        0
married       0
school        0
income        0
employed      0
insurance     0
medicaid      0
dtype: int64

In [47]:
df.dtypes

Unnamed: 0      int64
visits          int64
nvisits         int64
ovisits         int64
novisits        int64
emergency       int64
hospital        int64
health         object
chronic         int64
adl            object
region         object
age           float64
afam           object
gender         object
married        object
school          int64
income        float64
employed       object
insurance      object
medicaid       object
dtype: object

In [48]:
df.head()

Unnamed: 0.1,Unnamed: 0,visits,nvisits,ovisits,novisits,emergency,hospital,health,chronic,adl,region,age,afam,gender,married,school,income,employed,insurance,medicaid
0,1,5,0,0,0,0,1,average,2,normal,other,6.9,yes,male,yes,6,2.881,yes,yes,no
1,2,1,0,2,0,2,0,average,2,normal,other,7.4,no,female,yes,10,2.7478,no,yes,no
2,3,13,0,0,0,3,3,poor,4,limited,other,6.6,yes,female,no,10,0.6532,no,no,yes
3,4,16,0,5,0,1,1,poor,2,limited,other,7.6,no,male,yes,3,0.6588,no,yes,no
4,5,3,0,0,0,0,0,average,2,limited,other,7.9,no,female,yes,6,0.6588,no,yes,no


In [49]:
df["age"] = df['age'].apply(lambda x: x * 10)
df['age'] = df['age'].astype("int16")


In [50]:
df.head()

Unnamed: 0.1,Unnamed: 0,visits,nvisits,ovisits,novisits,emergency,hospital,health,chronic,adl,region,age,afam,gender,married,school,income,employed,insurance,medicaid
0,1,5,0,0,0,0,1,average,2,normal,other,69,yes,male,yes,6,2.881,yes,yes,no
1,2,1,0,2,0,2,0,average,2,normal,other,74,no,female,yes,10,2.7478,no,yes,no
2,3,13,0,0,0,3,3,poor,4,limited,other,66,yes,female,no,10,0.6532,no,no,yes
3,4,16,0,5,0,1,1,poor,2,limited,other,76,no,male,yes,3,0.6588,no,yes,no
4,5,3,0,0,0,0,0,average,2,limited,other,79,no,female,yes,6,0.6588,no,yes,no


In [53]:
df[['age', 'income']].describe()

Unnamed: 0,age,income
count,4406.0,4406.0
mean,74.024058,2.527132
std,6.33405,2.924648
min,66.0,-1.0125
25%,69.0,0.91215
50%,73.0,1.69815
75%,78.0,3.17285
max,109.0,54.8351


# Age

* based on the information, the average value for age is 74 years old and the data was represented in float type.

# Income

* Income format (float) was ok, but the decimals were wrong placed. 
* negative incomes should not exists


In [55]:
df.to_json("NSMES1988.json")