# Validating Your Data

The majority of a data scientist's time is actually spent preparing data because the data is seldom in any order to actually perform analysis. To prepare data for use, a data scientist must: <br /><br />
Get the data<br />
Aggregate the data<br />
Create data subsets<br />
Clean the data<br />
Develop a single dataset by merging various datasets together<br />

## Figuring out what’s in your data

Finding duplicates is important because you end up <br /><br />
Spending more computational time to process duplicates.<br />
Obtaining false results because duplicates implicitly overweight the results.<br /><br />
An example of how to find duplicates:

In [10]:
from lxml import objectify
import pandas as pd

xml = objectify.parse(open('XMLData2.xml'))
root = xml.getroot()
df = pd.DataFrame(columns=('Number', 'String', 'Boolean'))

for i in range(0,8):
    obj = root.getchildren()[i].getchildren()
    # zip a discussed before makes a tuple of combination of two lists such as [('Number', '1'), ('String', 'First'), ('Boolean', 'True')]
    row = dict(zip(['Number', 'String', 'Boolean'], 
                   [obj[0].text, obj[1].text, 
                    obj[2].text]))
    
    row_s = pd.Series(row)
    row_s.name = i
    df = df.append(row_s)
    
# gives you a Series values False or if duplicate True    
search = pd.DataFrame.duplicated(df)

print df
print
print search
print search[search == True]

  Number  String Boolean
0      1   First    True
1      2  Second   False
2      3   Third    True
3      3   Third    True
4      3   Third    True
5      3   Third    True
6      4  Fourth   False
7      4  Fourth   False

0    False
1    False
2    False
3     True
4     True
5     True
6    False
7     True
dtype: bool
3    True
4    True
5    True
7    True
dtype: bool


## Removing duplicates

In [12]:
from lxml import objectify
import pandas as pd

xml = objectify.parse(open('XMLData2.xml'))
root = xml.getroot()
df = pd.DataFrame(columns=('Number', 'String', 'Boolean'))

for i in range(0,8):
    obj = root.getchildren()[i].getchildren()
    row = dict(zip(['Number', 'String', 'Boolean'], 
                   [obj[0].text, obj[1].text, 
                    obj[2].text]))
    row_s = pd.Series(row)
    row_s.name = i
    df = df.append(row_s)
    
print df
print df.drop_duplicates()

  Number  String Boolean
0      1   First    True
1      2  Second   False
2      3   Third    True
3      3   Third    True
4      3   Third    True
5      3   Third    True
6      4  Fourth   False
7      4  Fourth   False
  Number  String Boolean
0      1   First    True
1      2  Second   False
2      3   Third    True
6      4  Fourth   False


# Manipulating Categorical Variables

## Creating categorical variables

Categorical variables have a specific number of values, which makes them incredibly valuable in performing a number of data science tasks. For example, imagine trying to find values that are out of range in a huge dataset. 

In [21]:
import pandas as pd

print pd.version.version

car_colors = pd.Series(['Blue', 'Red', 'Green'], dtype='category')

car_data = pd.Series(
    pd.Categorical(['Yellow', 'Green', 'Red', 'Blue', 'Purple'],
                   categories=list(car_colors), ordered=False))

find_entries = pd.isnull(car_data)

print car_colors
print
print car_data
print
print find_entries[find_entries == True]

0.16.2
0     Blue
1      Red
2    Green
dtype: category
Categories (3, object): [Blue, Green, Red]

0      NaN
1    Green
2      Red
3     Blue
4      NaN
dtype: category
Categories (3, object): [Blue, Red, Green]

0    True
4    True
dtype: bool


## Renaming levels

There are sometimes when the naming of the categories you use is inconvenient or otherwise wrong for a particular need. You can rename the categories as needed:

In [12]:
import pandas as pd

car_colors = pd.Series(['Blue', 'Red', 'Green'], 
                       dtype='category')
car_data = pd.Series(
    pd.Categorical(
        ['Blue', 'Green', 'Red', 'Blue', 'Red'],
        categories=list(car_colors), ordered=False))


car_colors.cat.categories = ["Purple", "Yellow", "Mauve"]
car_data.cat.categories = list(car_colors)

print car_colors
print car_data

0    Purple
1     Mauve
2    Yellow
dtype: category
Categories (3, object): [Purple, Yellow, Mauve]
0    Purple
1    Yellow
2     Mauve
3    Purple
4     Mauve
dtype: category
Categories (3, object): [Purple, Mauve, Yellow]


## Combining levels

A particular categorical level might be too small to offer significant data for analysis. Perhaps there are only a few of the values. In this case, combining several small categories might offer better analysis results.

In [20]:
import pandas as pd

car_colors = pd.Series(['Blue', 'Red', 'Green'], 
                       dtype='category')
car_data = pd.Series(
    pd.Categorical(
        ['Blue', 'Green', 'Red', 'Green', 'Red', 'Green'],
        categories=car_colors, ordered=False))

car_data.cat.categories = ["Blue_Red", "Red", "Green"]


car_data.ix[car_data.isin(['Red'])] = 'Blue_Red'

print
print car_data


0    Blue_Red
1       Green
2    Blue_Red
3       Green
4    Blue_Red
5       Green
dtype: category
Categories (3, object): [Blue_Red, Red, Green]


# Dealing with Dates in Your Data

## Formatting time values

Obtaining the correct date and time representation can make performing analysis a lot easier. For example, you often have to change the representation to obtain a correct sorting of values. Python provides two common methods of formatting date and time. The first technique is to call str(), which simply turns a datetime value into a string without formatting. The strftime() function requires more work because you must define how you want the datetime value to appear after conversion. When using strftime(), you must provide a string containing special directives that define the formatting. You can find a listing of these directives at http://strftime.org

In [1]:
import datetime as dt

now = dt.datetime.now()

print str(now)
print now.strftime('%a, %d %B %Y')

2015-09-05 11:12:02.874385
Sat, 05 September 2015


## Using the right time transformation

Time zones differences in local time can cause all sorts of problems when performing analysis. For that matter, some types of calculations simply require a time shift in order to get the right results.

In [10]:
import datetime as dt

now = dt.datetime.now()
timevalue = now + dt.timedelta(hours=2)

print now.strftime('%H:%M:%S')
print timevalue.strftime('%H:%M:%S')
print timevalue - now

18:00:10
20:00:10
2:00:00


# Dealing with Missing Data

## Finding out missing data

It's essential to find missing data in your dataset to avoid getting incorrect results from your analysis. The following code shows how you can obtain a listing of missing values without too much effort.

In [2]:
import pandas as pd
import numpy as np

s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])

print s.isnull()

print
print s[s.isnull()]

0    False
1    False
2    False
3     True
4    False
5    False
6     True
dtype: bool

3   NaN
6   NaN
dtype: float64


## Encoding missingness

After you figure out that dataset is missing information, you need to consider what to do about it. The three possibilities are to ignore the issue, fill in the missing items, or remove (drop) the missing entries from the dataset. Ignoring the problem could lead to all sorts of problems so in most cases it is not the right approach.

In [3]:
import pandas as pd
import numpy as np

s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])

print s.fillna(int(s.mean()))
print
print s.dropna()

0    1
1    2
2    3
3    3
4    5
5    6
6    3
dtype: float64

0    1
1    2
2    3
4    5
5    6
dtype: float64


# Slicing and Dicing

## Slicing rows

In [14]:
x = np.array([[[1, 2, 3],  [4, 5, 6],  [7, 8, 9],],
              [[11,12,13], [14,15,16], [17,18,19],],
              [[21,22,23], [24,25,26], [27,28,29]]])

x[1]

array([[11, 12, 13],
       [14, 15, 16],
       [17, 18, 19]])

## Slicing columns

In [15]:
x = np.array([[[1, 2, 3],  [4, 5, 6],  [7, 8, 9],],
              [[11,12,13], [14,15,16], [17,18,19],],
              [[21,22,23], [24,25,26], [27,28,29]]])

x[:,1]

array([[ 4,  5,  6],
       [14, 15, 16],
       [24, 25, 26]])

## Dicing

In [16]:
x = np.array([[[1, 2, 3],  [4, 5, 6],  [7, 8, 9],],
              [[11,12,13], [14,15,16], [17,18,19],],
              [[21,22,23], [24,25,26], [27,28,29]]])

print x[1,1]
print x[:,1,1]
print x[1,:,1]
print
print x[1:3, 1:3]

[14 15 16]
[ 5 15 25]
[12 15 18]

[[[14 15 16]
  [17 18 19]]

 [[24 25 26]
  [27 28 29]]]


# Concatenating and Transforming

## Adding new cases and variables

In [21]:
import pandas as pd

df = pd.DataFrame({'A': [2,3,1],
                   'B': [1,2,3],
                   'C': [5,3,4]})

print df
print

df1 = pd.DataFrame({'A': [4],
                    'B': [4],
                    'C': [4]})

print df1
print

df = df.append(df1)

print df
print

# resets the index, drop=True means dont add any column
df = df.reset_index(drop=True)
print df

df.loc[df.last_valid_index() + 1] = [5, 5, 5]
print
print df

df2 = pd.DataFrame({'D': [1, 2, 3, 4, 5]})

df = pd.DataFrame.join(df, df2)
print
print df

   A  B  C
0  2  1  5
1  3  2  3
2  1  3  4

   A  B  C
0  4  4  4

   A  B  C
0  2  1  5
1  3  2  3
2  1  3  4
0  4  4  4

   A  B  C
0  2  1  5
1  3  2  3
2  1  3  4
3  4  4  4

   A  B  C
0  2  1  5
1  3  2  3
2  1  3  4
3  4  4  4
4  5  5  5

   A  B  C  D
0  2  1  5  1
1  3  2  3  2
2  1  3  4  3
3  4  4  4  4
4  5  5  5  5


## Removing data

In [24]:
import pandas as pd

df = pd.DataFrame({'A': [2,3,1],
                   'B': [1,2,3],
                   'C': [5,3,4]})

print df
print

df = df.drop(df.index[[1]])
print df

# drop B on axis = 1 (columns)
df = df.drop('B', 1)
print
print df

   A  B  C
0  2  1  5
1  3  2  3
2  1  3  4

   A  B  C
0  2  1  5
2  1  3  4

   A  C
0  2  5
2  1  4


## Sorting and shuffling

In [29]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [2,1,2,3,3,5,4],
                   'B': [1,2,3,5,4,2,5],
                   'C': [5,3,4,1,1,2,3]})

df = df.sort_index(by=['A', 'B'], ascending=[True, True])
print df
print

df = df.reset_index(drop=True)
print df
print

index = df.index.tolist()
np.random.shuffle(index)
df = df.ix[index]
print df
print
df = df.reset_index(drop=True)
print
print df

   A  B  C
1  1  2  3
0  2  1  5
2  2  3  4
4  3  4  1
3  3  5  1
6  4  5  3
5  5  2  2

   A  B  C
0  1  2  3
1  2  1  5
2  2  3  4
3  3  4  1
4  3  5  1
5  4  5  3
6  5  2  2

   A  B  C
4  3  5  1
1  2  1  5
2  2  3  4
6  5  2  2
3  3  4  1
5  4  5  3
0  1  2  3


   A  B  C
0  3  5  1
1  2  1  5
2  2  3  4
3  5  2  2
4  3  4  1
5  4  5  3
6  1  2  3


# Aggregating Data at Any Level

In [32]:
import pandas as pd

df = pd.DataFrame({'Map': [0,0,0,1,1,2,2],
                   'Values': [1,2,3,5,4,2,5]})

print df
print


df['S'] = df.groupby('Map')['Values'].transform(np.sum)
df['M'] = df.groupby('Map')['Values'].transform(np.mean)
df['V'] = df.groupby('Map')['Values'].transform(np.var)

print df

   Map  Values
0    0       1
1    0       2
2    0       3
3    1       5
4    1       4
5    2       2
6    2       5

   Map  Values  S    M    V
0    0       1  6  2.0  1.0
1    0       2  6  2.0  1.0
2    0       3  6  2.0  1.0
3    1       5  9  4.5  0.5
4    1       4  9  4.5  0.5
5    2       2  7  3.5  4.5
6    2       5  7  3.5  4.5
