# 2 Identifying variables

In [41]:
import pandas as pd
df = pd.read_csv('train.csv')

### Showing the types and values

In [None]:
df.info()

You can also check how many missing values there are in a column using **isnull()** and **sum()**

In [None]:
df.Fence.isnull().sum()

## Describe the range of each feature

In [None]:
df.LotFrontage.max()

In [None]:
df.LotFrontage.min()

In [None]:
df.LotFrontage.describe()

### Unique values of a column
For categorical features, you can inspect which values occur in a column, using `unique`

In [None]:
df.MSZoning.unique()

In [None]:
df.PavedDrive.unique()

We can also show the frequency of every value for a column.

In [None]:
df.PavedDrive.value_counts()

### Convert values
Often, it is easier to process your data as numbers. For instance, the feature PavedDrive has a categorical label, bit of we want to use it in a regression algorithm we need to convert it to a number. In this case we convert it in the following way: N=0, P=1, Y=2 (assuming P means something like Partial).

In [None]:
paved_drive = {'N':0, 'P':1, 'Y':2} # setup a dictionary to do the conversion

def convert_paved_drive(p):
    return paved_drive[p] # return the value of p in paved_drive dict

convert_paved_drive('P') # returns 1 because P is at position 1 in the array (indexing starts at 0)

We can use Python's **map()** function to apply a function to every element in a collection (or more formally, an iterable). Note that we could alternatively pass the dictionary paved_drive to the map() function, since map() also accepts dictionaries.

In [None]:
df['PavedDriveN'] = df.PavedDrive.map(convert_paved_drive)

In [None]:
df.PavedDriveN[df.PavedDriveN < 2][:10]

#### Assignment:  convert KitchenQual to a number. In the description of the dataset it reads that the labels mean:

|Label|description|
|:---|---|
|Ex|Excellent|
|Gd|Good|
|TA|Typical/Average|
|Fa|Fair|
|Po|Poor|

In [42]:
KitchenQual={'Ex':4 ,'Gd':3 ,'TA':2 ,'Fa':1,'Po':0}

def convert_kitchenQual(p):
    return KitchenQual[p] 

df['KitchenQualN'] = df.KitchenQual.map(convert_kitchenQual)

df['KitchenQualN']

0       3
1       2
2       3
3       3
4       3
5       2
6       3
7       2
8       2
9       2
10      2
11      4
12      2
13      3
14      2
15      2
16      2
17      2
18      3
19      2
20      3
21      3
22      3
23      2
24      3
25      3
26      3
27      3
28      2
29      1
       ..
1430    2
1431    2
1432    2
1433    2
1434    2
1435    3
1436    2
1437    4
1438    2
1439    2
1440    2
1441    3
1442    4
1443    1
1444    3
1445    2
1446    2
1447    3
1448    2
1449    4
1450    2
1451    4
1452    2
1453    2
1454    3
1455    2
1456    2
1457    3
1458    3
1459    2
Name: KitchenQualN, Length: 1460, dtype: int64

### Add features

We can add new features to the Dataframe by simply assigning a value to it. In this example we will compute the sum of the 1st floor space and 2nd floor space.

In [None]:
# note you need to index these with [''] because in Python variable names cannot start with a number.
df['2FlrSF'] = df['1stFlrSF'] + df['2ndFlrSF']
df['2FlrSF'][:10]

In [None]:
# alternatively, give a list of columns.
df[['1stFlrSF', '2ndFlrSF', '2FlrSF']]

### Code book

Write a code book. A book in which you list some collections specifics, such as the number of samples, and for every variable a description, datatype, numeric/categorical, #missing values, the value range, an example of a value. After the analysis, you can include the distribution over each variable, how the data was cleaned (missing values and outliers) and transformed. Include every operation done on the data to allow exact replication of these steps.

| variable | description | datatype | numeric/categorical | #missing | range | example value |
|--|--|--|--|--:|:-:|--|
| 1stFlrSF | First Floor square feet | int | Numeric | 0 | 334-4602 | 334 |
| 2FlrSF | Sum First Floor + Second Floor Square Feet | int | Numeric | 0 | 334-5642 | 5642 |
| PavedDriveN | State of driveway | int | numeric | 0 | 0-2 (gravel/dirt, partially paved, paved)| 2 |
| PavedDrive | State of driveway | text | Categorical | 0 | N, P, Y (gravel/dirt, partially paved, paved) | N |
| BsmtQual | Height of the basement | text | Categorical | 37 | Ex, Gd, TA, Fa, Po, NA (Excellent 100+", Good 90-99", Typical 80-89", Fair 70-79", Poor <70", No Basement | Ex |
| KitchenQual | Quality of Kitchen | int | Numeric | 0 | Ex, Gd, TA, Fa, Po, NA (Excellent 4", Good 3", Typical/Average 2", Fair 1", Poor 0" | Ex |

#### Assignment: Add KitchenQual to the Code Book