# Importing, Exploring, and Manipulating Data

![SegmentLocal](https://media.giphy.com/media/xT9C25UNTwfZuk85WP/giphy.gif "segment")

## Intro to Pandas

![SegmentLocal](https://media2.giphy.com/media/EatwJZRUIv41G/giphy.gif "segment")

A package that helps python become a better and more efficient source for data representation 

Built on top of the NumPy package (discussed previously!)

### Perks!! 

Good for cleaning data because allows for deleting and inserting of columns in DataFrames and higher dimensional objects 

Allows for either explicit or automatic data alignment, depending on what the user decides, but either way, the data will be aligned. 

Easy handling of missing data!

Intelligent and quick label-based indexing, and subsetting of large data sets

![SegmentLocal](https://media2.giphy.com/media/4527jA8ErzDAYkJ9cf/giphy.gif "segment")

In [1]:
#Import it:
import numpy as np
import pandas as pd

Has two primary data structures! 

### Series

Series are 1 dimensional, and are created using a list:

In [2]:
series = pd.Series([1, 2, 3])

### DataFrame 

DataFrames are 2 dimensional, and are created using a NumPy array

Example 1:

In [3]:
dates = pd.date_range('20130101', periods=6)
# Creates ['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
#               '2013-01-05', '2013-01-06'],

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
#  Creates          A         B         C         D
#2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
#2013-01-02  1.212112 -0.173215  0.119209 -1.044236
#2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
#2013-01-04  0.721555 -0.706771 -1.039575  0.271860
#2013-01-05 -0.424972  0.567020  0.276232 -1.087401
#2013-01-06 -0.673690  0.113648 -1.478427  0.524988



credit : https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

Example 2:

In [4]:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
# Creates:
#     A          B    C  D      E    F
#0  1.0 2013-01-02  1.0  3   test  foo
#1  1.0 2013-01-02  1.0  3  train  foo
#2  1.0 2013-01-02  1.0  3   test  foo
#3  1.0 2013-01-02  1.0  3  train  foo


credit : https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

### Viewing Data

To view the top rows of the data:

In [5]:
df2.head()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


To view the bottom rows of the data:

In [6]:
df2.tail()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


For both of these, the parameter would be the number of rows you want to view!

## Importing and Exploring Data

Import data from a spreadsheet (.csv format) using `read_csv()`. We're going to be using a data set that has information about the Kansas City housing market. It contains info about various attributes of a house (number of bedrooms, number of floors, squarefoot size of living space, year built, etc.) and the price of each house.

![SegmentLocal](https://media0.giphy.com/media/TncmRRvEGVoVcHgaAb/giphy.gif?cid=790b7611667c6e58e8528ed47768c66e5ce2440012b719c9&rid=giphy.gif&ct=g "segment")

In [7]:
import pandas as pd
import numpy as np

data = pd.read_csv("../housing_data.csv")

As covered earlier, `.head()` and `.tail()` can be used to view the top and bottom rows of much bigger datasets stored in a dataframe, too:

In [8]:
data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [9]:
data.tail()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
21608,263000018,20140521T000000,360000.0,3,2.5,1530,1131,3.0,0,0,...,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,6600060120,20150223T000000,400000.0,4,2.5,2310,5813,2.0,0,0,...,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,...,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,291310100,20150116T000000,400000.0,3,2.5,1600,2388,2.0,0,0,...,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287
21612,1523300157,20141015T000000,325000.0,2,0.75,1020,1076,2.0,0,0,...,7,1020,0,2008,0,98144,47.5941,-122.299,1020,1357


For a summary of the entire dataframe, use `.info()`. This tells you what the column names are, how many values are in each column, whether all the values in a column are filled (`non-null`), and what the data type of the values each column are.

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
id               21613 non-null int64
date             21613 non-null object
price            21613 non-null float64
bedrooms         21613 non-null int64
bathrooms        21613 non-null float64
sqft_living      21613 non-null int64
sqft_lot         21613 non-null int64
floors           21613 non-null float64
waterfront       21613 non-null int64
view             21613 non-null int64
condition        21613 non-null int64
grade            21613 non-null int64
sqft_above       21613 non-null int64
sqft_basement    21613 non-null int64
yr_built         21613 non-null int64
yr_renovated     21613 non-null int64
zipcode          21613 non-null int64
lat              21613 non-null float64
long             21613 non-null float64
sqft_living15    21613 non-null int64
sqft_lot15       21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB


## Manipulating Data

Once you have data in Pandas, there are various Pandas commands that allow you to explore and manipulate your data even further. For example, you can filter your data for certain conditions using `.loc()`. To filter for houses with 2 bedrooms and 2 bathrooms and see their prices, use:

In [11]:
data.loc[(data["bedrooms"]==2) & (data["bathrooms"]==2), ["bedrooms","bathrooms","price"]]

Unnamed: 0,bedrooms,bathrooms,price
226,2,2.0,479950.0
255,2,2.0,592500.0
438,2,2.0,438000.0
470,2,2.0,290900.0
487,2,2.0,207950.0
525,2,2.0,727500.0
537,2,2.0,595000.0
547,2,2.0,259950.0
640,2,2.0,378000.0
773,2,2.0,450000.0


You can also apply a function to the data to do some computation using the `apply()` function. For example, you can apply the `numpy` `sum` function to each column in the data set by using the `apply` function, passing in the `np.sum` function as a parameter, and setting the `axis` parameter to 0 to let it know to apply the function along columns (to apply along rows, you could also do `axis = 1`):

In [12]:
data.apply(np.sum, axis = 0)

id                                                  98994056770455
date             20141013T00000020141209T00000020150225T0000002...
price                                                  1.16729e+10
bedrooms                                                     72854
bathrooms                                                  45706.2
sqft_living                                               44952873
sqft_lot                                                 326506890
floors                                                     32296.5
waterfront                                                     163
view                                                          5064
condition                                                    73688
grade                                                       165488
sqft_above                                                38652488
sqft_basement                                              6300385
yr_built                                                  4259

`.apply()` returns another DataFrame, which you can query just like any other DataFrame in Pandas, so if you wanted to get values in a particular row, you could still:

In [13]:
data.apply(np.sum, axis = 0)['waterfront']

163

`.apply` also works with functions you write yourself!

In [14]:
def count_missing_vals(x):
  return sum(x.isnull())

data.apply(count_missing_vals, axis=0)

NameError: name 'num_missing' is not defined

You can also sort a DataFrame by any values you want. For example, you can sort by the year a home was renovated, specifically sorting by decending value by setting `ascending=False`. Sorting the DataFrame still returns a DataFrame, so you can query it for certain columns as usual:

In [None]:
sorted_reno_date = data.sort_values(['yr_renovated'], ascending=False)
sorted_reno_date[['price','yr_built','yr_renovated']]

Unnamed: 0,price,yr_built,yr_renovated
8692,1485000.0,1964,2015
18575,476000.0,1945,2015
4240,815000.0,1962,2015
7958,203000.0,1952,2015
2295,585000.0,1922,2015
19444,872500.0,1956,2015
16683,420000.0,1961,2015
13216,579000.0,1962,2015
3156,830000.0,1968,2015
7097,285000.0,1940,2015


![SegmentLocal](https://media1.giphy.com/media/FSzLVme5Y3n3LMOiqP/giphy.gif?cid=790b7611b95a0c5e3af81fa92912fe9aefc81203805f9762&rid=giphy.gif&ct=g=g "segment")