# feature engineering: 
It is, taking whatever information you have about your problem and turning it into numbers that you can use to build your feature matrix.

we will cover a few common examples of feature engineering tasks: features for representing categorical data, features for representing text, and features for representing images. Additionally, we will discuss derived features for increasing model complexity and imputation of missing data. Often this process is known as vectorization, as it involves converting arbitrary data into well-behaved vectors.

### Categorical Features
One common type of non-numerical data is categorical data.

In [1]:
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]
data

[{'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
 {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
 {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
 {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}]

In [3]:
# You might be tempted to encode this data with a straightforward numerical mapping:
# {'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3};
#but it accept it as algebraric quantity
#like Queen Anne < Fremont < Wallingford, or even that Wallingford - Queen Anne = Fremont,which is wrong

In this case, one proven technique is to use one-hot encoding, which effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0, respectively. When your data comes as a list of dictionaries, Scikit-Learn's DictVectorizer will do this for you:

In [4]:
from sklearn.feature_extraction import DictVectorizer

In [32]:
vec= DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]], dtype=int32)

In [7]:
vec.get_feature_names()

['neighborhood=Fremont',
 'neighborhood=Queen Anne',
 'neighborhood=Wallingford',
 'price',
 'rooms']

In [11]:
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)

<4x5 sparse matrix of type '<class 'numpy.int32'>'
	with 12 stored elements in Compressed Sparse Row format>

In [51]:
import pandas as pd
data1= pd.read_csv('C:\\Users\POOJA\Downloads\python\Projects-Radical\Past Hires\\PastHires.csv')
data1


Unnamed: 0,Years Experience,Employed?,Previous employers,Level of Education,Top-tier school,Interned,Hired
0,10,Y,4,BS,N,N,Y
1,0,N,0,BS,Y,Y,Y
2,7,N,6,BS,N,N,N
3,2,Y,1,MS,Y,N,Y
4,20,N,2,PhD,Y,N,N
5,0,N,0,PhD,Y,Y,Y
6,5,Y,2,MS,N,Y,Y
7,3,N,1,BS,N,Y,Y
8,15,Y,5,BS,N,N,Y
9,0,N,0,BS,N,N,N


In [47]:
#to apply DictVecorizer,we have to convert dataframeto dictionary
data_dict = data1.to_dict(orient='records')
data_dict

[{'Years Experience': 10,
  'Employed?': 'Y',
  'Previous employers': 4,
  'Level of Education': 'BS',
  'Top-tier school': 'N',
  'Interned': 'N',
  'Hired': 'Y'},
 {'Years Experience': 0,
  'Employed?': 'N',
  'Previous employers': 0,
  'Level of Education': 'BS',
  'Top-tier school': 'Y',
  'Interned': 'Y',
  'Hired': 'Y'},
 {'Years Experience': 7,
  'Employed?': 'N',
  'Previous employers': 6,
  'Level of Education': 'BS',
  'Top-tier school': 'N',
  'Interned': 'N',
  'Hired': 'N'},
 {'Years Experience': 2,
  'Employed?': 'Y',
  'Previous employers': 1,
  'Level of Education': 'MS',
  'Top-tier school': 'Y',
  'Interned': 'N',
  'Hired': 'Y'},
 {'Years Experience': 20,
  'Employed?': 'N',
  'Previous employers': 2,
  'Level of Education': 'PhD',
  'Top-tier school': 'Y',
  'Interned': 'N',
  'Hired': 'N'},
 {'Years Experience': 0,
  'Employed?': 'N',
  'Previous employers': 0,
  'Level of Education': 'PhD',
  'Top-tier school': 'Y',
  'Interned': 'Y',
  'Hired': 'Y'},
 {'Years Exp

In [52]:
vec3= DictVectorizer(sparse= False)
PastHire= vec3.fit_transform(data_dict)

In [70]:
print(vec3.get_feature_names())
pd.DataFrame(PastHire,columns=vec3.get_feature_names())

['Employed?=N', 'Employed?=Y', 'Hired=N', 'Hired=Y', 'Interned=N', 'Interned=Y', 'Level of Education=BS', 'Level of Education=MS', 'Level of Education=PhD', 'Previous employers', 'Top-tier school=N', 'Top-tier school=Y', 'Years Experience']


Unnamed: 0,Employed?=N,Employed?=Y,Hired=N,Hired=Y,Interned=N,Interned=Y,Level of Education=BS,Level of Education=MS,Level of Education=PhD,Previous employers,Top-tier school=N,Top-tier school=Y,Years Experience
0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,4.0,1.0,0.0,10.0
1,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,6.0,1.0,0.0,7.0
3,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,2.0
4,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,20.0
5,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
6,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,1.0,0.0,5.0
7,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,3.0
8,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,5.0,1.0,0.0,15.0
9,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


###  Text Features
Another common need in feature engineering is to convert text to a set of representative numerical values. 
One of the simplest methods of encoding data is by word counts: you take each snippet of text, count the occurrences of each word within it, and put the results in a table.

In [10]:
sample = ['problem of evil',
          'evil queen',
          'horizon problem']

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [34]:
vec1 = CountVectorizer()
x= vec1.fit_transform(sample)
x

<3x5 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

The result is a sparse matrix recording the number of times each word appears; it is easier to inspect if we convert this to a DataFrame with labeled columns:

In [37]:
import pandas as pd
pd.DataFrame(x.toarray(), columns=vec1.get_feature_names())

Unnamed: 0,evil,horizon,of,problem,queen
0,1,0,1,1,0
1,1,0,0,0,1
2,0,1,0,1,0


There are some issues with this approach, however: the raw word counts lead to features which put too much weight on words that appear very frequently, and this can be sub-optimal in some classification algorithms. One approach to fix this is known as term frequency-inverse document frequency (TF–IDF) which weights the word counts by a measure of how often they appear in the documents. The syntax for computing these features is similar to the previous example:

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec2 = TfidfVectorizer()
y= vec2.fit_transform(sample)
pd.DataFrame(y.toarray(),columns= vec2.get_feature_names())

Unnamed: 0,evil,horizon,of,problem,queen
0,0.517856,0.0,0.680919,0.517856,0.0
1,0.605349,0.0,0.0,0.0,0.795961
2,0.0,0.795961,0.0,0.605349,0.0


In [62]:
e = vec1.fit_transform(data_dict[0])
e

<7x13 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [64]:
pd.DataFrame(e.toarray(), columns= vec1.get_feature_names())

Unnamed: 0,education,employed,employers,experience,hired,interned,level,of,previous,school,tier,top,years
0,0,0,0,1,0,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,1,0,0,0,0
3,1,0,0,0,0,0,1,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,1,1,0
5,0,0,0,0,0,1,0,0,0,0,0,0,0
6,0,0,0,0,1,0,0,0,0,0,0,0,0


In [61]:
z= vec2.fit_transform(data_dict[0])
print(z)
pd.DataFrame(z.toarray(), columns= vec2.get_feature_names())

  (0, 12)	0.7071067811865476
  (0, 3)	0.7071067811865476
  (1, 1)	1.0
  (2, 8)	0.7071067811865476
  (2, 2)	0.7071067811865476
  (3, 6)	0.5773502691896257
  (3, 7)	0.5773502691896257
  (3, 0)	0.5773502691896257
  (4, 11)	0.5773502691896257
  (4, 10)	0.5773502691896257
  (4, 9)	0.5773502691896257
  (5, 5)	1.0
  (6, 4)	1.0


Unnamed: 0,education,employed,employers,experience,hired,interned,level,of,previous,school,tier,top,years
0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0
3,0.57735,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735,0.57735,0.0
5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
