-
Notifications
You must be signed in to change notification settings - Fork 0
Day 3 (9 10 2019): Data Manipulation with Pandas
Jarlin Almanzar edited this page Sep 10, 2019
·
9 revisions
Only two types you will ever need
- a 2D Data structure to hold data identified by rows and columns.
- a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, python objects, etc.)
- Column names: Must be unique
- Left: The index values are all unique and numeric, acting as a row number
- Right: The index values are named and non-unique
import pandas as pd
import numpy as np
tmp = pd.DataFrame()
tmp
import pandas as pd
tmp = pd.Series([2,3,3,4])
tmp
import pandas as pd
import numpy as np
import matplotlib as plt
pd._version_
%matplotlib inline
df=pd.read_csv("filename.csv")
df.head()
* Combine two dataframes
first = pd.DataFrame(np.random.randn(5,4))
second = pd.DataFrame(np.random.randn(5,4))
concat(first, second)
- merge two data frames by a key
left = pd.DataFrame({'netId' : ['adg111', 'adg2', 'adg3'] 'midterm': [67, 89, 90]})
right = pd.DataFrame({'netId' : ['adg111', 'adg2', 'adg3'] 'midterm': [67, 89, 90]})
pd.merge(left,right, on="netId")
- Column Selection: Extract a column/series or a series of columns/series. This is known as "indexing by columns"
filter by fd["filter_name"]
df1 = fd['midterm']
df2 = fd['midterm', 'Finals']
- Larges Files : Under 100mb is typically fine
- Very Large files: multiple gigabytes can be a problem. Can use different tools like Apache Spark or read data "chunks" as a time
- How Panda manga memory: repsresents numerical values as NumPy ndarry.