# Machine Learning Methods: Features

## A feature is an individual measurable property or characteristic of a data set.

## We're going to generate some synthetic data and demonstarate the features of the dataset!

In [36]:
## First lets import some libraries we will be using!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Now something new using sklearn, make_blobs to help us create our data!
from sklearn.datasets import make_blobs

In [5]:
## make_blobs gives us the ability to create some synthetic data
## We can define the number of samples we want, so 100
## We can also define the number of categories using the 'centers' parameter

X,y = make_blobs(n_samples = 100, centers = 3)

In [31]:
## Lets show off the shape and type of X below

X.shape

(100, 2)

In [15]:
type(X)

numpy.ndarray

In [23]:
## Now lets view the first row of X
## So looking at this first row we can see the first 2 features of the data! 
## The first column is of feature 1 and the 2nd column is of feature 2
## Both of these have data in row 0 or what would be called Sample 1 for the terms of the data

X[0]

array([-6.60077708, -3.42333109])

In [25]:
## Sililar to the above feature 1 and 2 have data for sample 2 or row 1

X[1]

array([-6.37924037, -2.79020101])

In [29]:
## We can also do in mass for the first 10 samples to see features 1 and 2 for the first 10 samples

X[0:9]

array([[-6.60077708, -3.42333109],
       [-6.37924037, -2.79020101],
       [-0.02947559,  8.7289871 ],
       [ 7.26867915,  1.86416641],
       [ 6.76008494,  3.7120952 ],
       [ 7.43264756,  2.49907768],
       [ 8.56565558, -1.33261361],
       [-6.61085153, -3.5601748 ],
       [-8.33555395, -3.82939368]])

## Now lets try with the iris dataset from seaborn

In [38]:
## Lets import the iris dataset

A = sns.load_dataset('iris')

In [44]:
## So looking at this data we can see there are 5 total features in this datafeame. Starting with sepal_length which is feature 1
## To elaborate further for the first sample, the 3rd feature is 1.4 under the petal_length

A.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Now lets try again with some new data specifically made for machine learning testing! We'll use data from the UCL Machine Learning Repository

In [51]:
## First lets import the data from the Bias_correction_ucl.csv
## The data is regression data that will be covered more in the regression lessions

df = pd.read_csv('Bias_correction_ucl.csv')

In [55]:
## As you can see there is a large number of features for this dataset! There are 25 total columns meaning 25 features for this dataframe

df.head()

Unnamed: 0,station,Date,Present_Tmax,Present_Tmin,LDAPS_RHmin,LDAPS_RHmax,LDAPS_Tmax_lapse,LDAPS_Tmin_lapse,LDAPS_WS,LDAPS_LH,...,LDAPS_PPT2,LDAPS_PPT3,LDAPS_PPT4,lat,lon,DEM,Slope,Solar radiation,Next_Tmax,Next_Tmin
0,1.0,30/06/2013,28.7,21.4,58.255688,91.116364,28.074101,23.006936,6.818887,69.451805,...,0.0,0.0,0.0,37.6046,126.991,212.335,2.785,5992.895996,29.1,21.2
1,2.0,30/06/2013,31.9,21.6,52.263397,90.604721,29.850689,24.035009,5.69189,51.937448,...,0.0,0.0,0.0,37.6046,127.032,44.7624,0.5141,5869.3125,30.5,22.5
2,3.0,30/06/2013,31.6,23.3,48.690479,83.973587,30.091292,24.565633,6.138224,20.57305,...,0.0,0.0,0.0,37.5776,127.058,33.3068,0.2661,5863.555664,31.1,23.9
3,4.0,30/06/2013,32.0,23.4,58.239788,96.483688,29.704629,23.326177,5.65005,65.727144,...,0.0,0.0,0.0,37.645,127.022,45.716,2.5348,5856.964844,31.7,24.3
4,5.0,30/06/2013,31.4,21.9,56.174095,90.155128,29.113934,23.48648,5.735004,107.965535,...,0.0,0.0,0.0,37.5507,127.135,35.038,0.5055,5859.552246,31.2,22.5


In [57]:
## Checking the shape of the dataframe we can see that there are 7752 samples and 25 features

df.shape

(7752, 25)