<a href="https://colab.research.google.com/github/DouglasMitchum/Arrays_Example_1/blob/master/data_structures_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data frames

## DataFrames in Pandas (Python Data Analysis Library)
*  Pandas DataFrame is a two-dimensional data structure  
*  Columns can be different data type
*  Think of dataframe as a table. Each column represents a common data type
*  It is relatively convenient to manipulate data and extract valuable information


### Creating DataFrames
**Dictionaries in Python**: We will use dictionaries in Python. Consider column names as “Keys” and list of items under that column as *“Values”*.

In [1]:
import pandas as pd

In [3]:
df = pd.DataFrame(data = {'First Name': ['James', 'Alexander', 'Ashley'],
                    'Last Name':['Cook', 'Bell', 'Mason'],
                    'Age': [38, 42, 23]})

In [4]:
df

Unnamed: 0,First Name,Last Name,Age
0,James,Cook,38
1,Alexander,Bell,42
2,Ashley,Mason,23


In [5]:
my_dict = {'First Name': ['James', 'Alexander', 'Ashley'],
           'Last Name':['Cook', 'Bell', 'Mason'],
           'Age': [38, 42, 23]}

In [95]:
df = pd.DataFrame(my_dict)
df

Unnamed: 0,First Name,Last Name,Age
0,James,Cook,38
1,Alexander,Bell,42
2,Ashley,Mason,23


Specify row indices externally

In [90]:
df = pd.DataFrame(my_dict, index=[1, 2, 3]) #numeric row indexing
df

Unnamed: 0,First Name,Last Name,Age
1,James,Cook,38
2,Alexander,Bell,42
3,Ashley,Mason,23


In [8]:
type(df)

pandas.core.frame.DataFrame

In [None]:
help(df)

Specify row indices internally

In [None]:
my_dict = {'First Name': {1: 'James', 2: 'Alexander', 3: 'Ashley'}, #1,2,3 will indicate row indices in dataframe
           'Last Name': {2: 'Cook', 2: 'Bell', 3: 'Mason'},
           'Age': {1: 38, 2: 42, 3: 23}}
df = pd.DataFrame(my_dict)
df

A column of a DataFrame will always be of same type

In [11]:
df['Age']

1    38
2    42
3    23
Name: Age, dtype: int64

In [12]:
df.Age

1    38
2    42
3    23
Name: Age, dtype: int64

In [13]:
df.Age.dtype

dtype('int64')

In [14]:
df.dtypes

First Name    object
Last Name     object
Age            int64
dtype: object

### List of Lists

In [15]:
myList = [['James','Cook',38],
          ['Alexander','Bell',42],
          ['Ashley','Mason',23]]
myList

[['James', 'Cook', 38], ['Alexander', 'Bell', 42], ['Ashley', 'Mason', 23]]

In [87]:
df2 = pd.DataFrame(myList, columns = ['First Name', 'Last Name', 'Age'])
df2

Unnamed: 0,First Name,Last Name,Age
0,James,Cook,38
1,Alexander,Bell,42
2,Ashley,Mason,23


In [None]:
df2.dtypes

Dataframe to dictionary

In [17]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
           'Last Name': ['Cook','Bell','Mason'],
           'Age': [38, 42, 23]}

df = pd.DataFrame(my_dict, index = ['a','b','c'])

df

Unnamed: 0,First Name,Last Name,Age
a,James,Cook,38
b,Alexander,Bell,42
c,Ashley,Mason,23


In [18]:
df.to_dict()

{'First Name': {'a': 'James', 'b': 'Alexander', 'c': 'Ashley'},
 'Last Name': {'a': 'Cook', 'b': 'Bell', 'c': 'Mason'},
 'Age': {'a': 38, 'b': 42, 'c': 23}}

Dataframe to list

In [19]:
my_list = df.values.tolist() # List of lists. we loose column names
my_list

[['James', 'Cook', 38], ['Alexander', 'Bell', 42], ['Ashley', 'Mason', 23]]

### DataFrame Indices

In [96]:
df

Unnamed: 0,First Name,Last Name,Age
0,James,Cook,38
1,Alexander,Bell,42
2,Ashley,Mason,23


In [21]:
df = pd.DataFrame(
     my_dict,
     index=['First Row', 'Second Row', 'Third Row'])#string row indexing
df

Unnamed: 0,First Name,Last Name,Age
First Row,James,Cook,38
Second Row,Alexander,Bell,42
Third Row,Ashley,Mason,23


In [22]:
import numpy as np

In [100]:
np_arr = np.array([1,2,3]) #homojeneous data type in numpy arrays
df = pd.DataFrame(my_dict, index = np_arr) #set numpy array as index
df

Unnamed: 0,First Name,Last Name,Age
1,James,Cook,38
2,Alexander,Bell,42
3,Ashley,Mason,23


In [101]:
df.index = ['One' , 'Two' , 'Three'] #if df is already created
df

Unnamed: 0,First Name,Last Name,Age
One,James,Cook,38
Two,Alexander,Bell,42
Three,Ashley,Mason,23


In [26]:
myIndices = list(df.index) #assign dataframe indices to a variable as a list
myIndices

['One', 'Two', 'Three']

.set_index()

In [None]:
help(df.set_index)

In [102]:
df_new = df.set_index('Age') # Creates a new dataframe with indices named “Age”
df.set_index('Age', inplace = True) # Modifies existing dataframe
df # Updated

Unnamed: 0_level_0,First Name,Last Name
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
38,James,Cook
42,Alexander,Bell
23,Ashley,Mason


### DataFrame Columns

In [28]:
myColumns = list(df.columns) #assign dataframe columns to a variable as a list
myColumns

['First Name', 'Last Name']

In [92]:
df = pd.DataFrame(my_dict, index = [1, 2, 3])
df.columns = ['First_Name', 'Last_Name', 'Age']
df

Unnamed: 0,First_Name,Last_Name,Age
1,James,Cook,38
2,Alexander,Bell,42
3,Ashley,Mason,23


In [30]:
df_new = df[['First_Name']] #returns a new dataframe with a single column from df
df_new

Unnamed: 0,First_Name
1,James
2,Alexander
3,Ashley


In [31]:
df_new = df[['First_Name','Age']]
df_new

Unnamed: 0,First_Name,Age
1,James,38
2,Alexander,42
3,Ashley,23


In [33]:
type(df_new)

pandas.core.frame.DataFrame

In [103]:
df = df[['Age', 'First_Name', 'Last_Name']] #re-arranging the columns
df

KeyError: ignored

In [35]:
ps = df['First_Name']
ps

1        James
2    Alexander
3       Ashley
Name: First_Name, dtype: object

In [36]:
type(ps)

pandas.core.series.Series

Some Useful Column Functions

In [41]:
import pandas as pd
import numpy as np

In [42]:
my_dict = {'First Name': ['James', 'Alexander', 'Ashley'],
           'last Name' : ['Cook', 'Bell','Mason'],
           'Age' :[38, 42, 23]}
df = pd.DataFrame(my_dict) #loads all columns from the dictionary

In [32]:
df.Age.mean()

34.333333333333336

In [37]:
df['Age'].mean()

34.333333333333336

In [43]:
np.std(df['Age'])

8.178562764256865

In [44]:
max(df['Age'])

42

In [45]:
min(df['Age'])

23

In [47]:
df['First Name'].unique()

array(['James', 'Alexander', 'Ashley'], dtype=object)

Loading Specific Columns

In [109]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
           'Last Name': ['Cook','Bell','Mason'],
           'Age': [38, 42, 23]}
df = pd.DataFrame(my_dict) #loads all columns from the dictionary
df1 = pd.DataFrame(my_dict, columns = ['First Name', 'Last Name'])
df1 ##only specific columns are loaded (saves memory)


Unnamed: 0,First Name,Last Name
0,James,Cook
1,Alexander,Bell
2,Ashley,Mason


In [110]:
columns = ['First Name', 'Last Name']
df[columns]

Unnamed: 0,First Name,Last Name
0,James,Cook
1,Alexander,Bell
2,Ashley,Mason


Deleting Columns

In [111]:
del df1["First Name"] #removes a column from existing dataframe
df1

Unnamed: 0,Last Name
0,Cook
1,Bell
2,Mason


Drop Function: deleting row or columns

In [None]:
help(df.drop)

In [112]:
df

Unnamed: 0,First Name,Last Name,Age
0,James,Cook,38
1,Alexander,Bell,42
2,Ashley,Mason,23


In [113]:
new_df = df.drop(labels='Age', axis=1) #drops column “Age” and creates a new dataframe. Existing df stays intact.
new_df

Unnamed: 0,First Name,Last Name
0,James,Cook
1,Alexander,Bell
2,Ashley,Mason


In [114]:
df

Unnamed: 0,First Name,Last Name,Age
0,James,Cook,38
1,Alexander,Bell,42
2,Ashley,Mason,23


In [115]:
df.drop(labels ='Age', axis = 1, inplace = True) #drops column “Age” and updates existing df. default inplace=False
df

Unnamed: 0,First Name,Last Name
0,James,Cook
1,Alexander,Bell
2,Ashley,Mason


In [116]:
new_df = df.drop(0) #delete the first row and create a new dataframe.
new_df

Unnamed: 0,First Name,Last Name
1,Alexander,Bell
2,Ashley,Mason


In [117]:
df.drop(1, inplace = True) #deletes row labeled 1 from df.
df

Unnamed: 0,First Name,Last Name
0,James,Cook
2,Ashley,Mason


Drops multiple columns and rows

In [118]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
           'Last Name': ['Cook','Bell','Mason'],
           'Age': [38, 42, 23]}
df = pd.DataFrame(my_dict)
new_df = df.drop(labels = ['First Name','Age'], axis = 1) #drops multiple columns
new_df

Unnamed: 0,Last Name
0,Cook
1,Bell
2,Mason


In [119]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
           'Last Name': ['Cook','Bell','Mason'],
           'Age': [38, 42, 23]}
df = pd.DataFrame(my_dict)
new_df = df.drop(labels = [1,2], axis = 0)# delete rows with index “1“ and “2”
new_df

Unnamed: 0,First Name,Last Name,Age
0,James,Cook,38


### Dimensions

In [121]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
           'Last Name': ['Cook','Bell','Mason'],
           'Age': [38, 42, 23]}

df = pd.DataFrame(my_dict, index=[1, 2, 3])

del df['Age'] #now it 3 rows 2 columns dataframe

In [123]:
dim = df.shape #stores dimension with a pair in a tuple
dim

(3, 2)

In [124]:
nrows = dim[0]
nrows

3

In [125]:
ncols = dim[1]
ncols

2

### Data Selection

In [126]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
            'Last Name': ['Cook','Bell','Mason'],
             'Age': [38, 42, 23]}
df = pd.DataFrame(my_dict, index=[1,2,3])

In [127]:
df[0:2] #select first two rows and all columns and create a new dataframe

Unnamed: 0,First Name,Last Name,Age
1,James,Cook,38
2,Alexander,Bell,42


In [128]:
df[0:2][['Age']] #select first two rows and column named “Age”, create a new data frame

Unnamed: 0,Age
1,38
2,42


In [129]:
df[1:2][['Age','Last Name']] #select second row and columns “Age” and “Last Name”

Unnamed: 0,Age,Last Name
2,42,Bell


**Note:** We cannot select two or more rows without a sequential manner.

**.loc** selection by label. Works for label-based data selection. You must specify rows and columns based on dataframe row and column labels

In [131]:
my_dict = {'First Name':['James', 'Alexander', 'Ashley'],
            'Last Name': ['Cook','Bell','Mason'],
             'Age': [38, 42, 23]}
df = pd.DataFrame(my_dict, index=[1,2,3])
df

Unnamed: 0,First Name,Last Name,Age
1,James,Cook,38
2,Alexander,Bell,42
3,Ashley,Mason,23


In [132]:
df.loc[[1,3]] #return two rows and all columns with index name 1 and 3

Unnamed: 0,First Name,Last Name,Age
1,James,Cook,38
3,Ashley,Mason,23


In [133]:
df.loc[[1,3]][['Age']] #returns two rows and columns named “Age”

Unnamed: 0,Age
1,38
3,23


In [134]:
df.loc[1:3] #this is not slicing. this means “select rows from 1 to 3”

Unnamed: 0,First Name,Last Name,Age
1,James,Cook,38
2,Alexander,Bell,42
3,Ashley,Mason,23


In [140]:
df.loc[1:] # select all rows starting from row label 1

Unnamed: 0,First Name,Last Name,Age
1,James,Cook,38
2,Alexander,Bell,42
3,Ashley,Mason,23


In [141]:
new_df = pd.DataFrame(my_dict, index = ['a','b','c'])
new_df

Unnamed: 0,First Name,Last Name,Age
a,James,Cook,38
b,Alexander,Bell,42
c,Ashley,Mason,23


In [142]:
new_df.loc[1:3] #throws and error as data slicing is not allowed

TypeError: ignored

In [143]:
new_df.loc['a':'c'] #return rows from a to c.

Unnamed: 0,First Name,Last Name,Age
a,James,Cook,38
b,Alexander,Bell,42
c,Ashley,Mason,23


**.iloc** Selection by position

In [144]:
new_df.loc[1:3] #throws and error as data slicing is not allowed

TypeError: ignored

In [145]:
new_df.iloc[1:3] #select from second row to third row

Unnamed: 0,First Name,Last Name,Age
b,Alexander,Bell,42
c,Ashley,Mason,23


In [None]:
new_df.iloc[[0,2]] #returns first row and third row

In [None]:
new_df.iloc[[0,2]] [['Age']] #returns first row and third row and column label “Age”

In [None]:
new_df.iloc[[0, 2], [2]] #returns first row and third row and third column

In [None]:
new_df.iloc[[0, 2], [0,2]] #returns more columns

### Online Data Sets

In [147]:
import seaborn as sns
iris = sns.load_dataset('iris')
type(iris)
iris.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [149]:
import pandas as pd
nba = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")
type(nba)
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [151]:
nba1 = pd.read_csv("/content/nba.csv")
nba1.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [153]:
nba1.to_csv('nba(2).txt')

In [148]:
import pandas as pd
mtcars = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
mtcars.rename(columns={'Unnamed: 0':'brand'}, inplace=True)
mtcars.head()

Unnamed: 0,brand,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


### Missing Values

In [154]:
import numpy as np

In [155]:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, 3, 5],
                   [np.nan, 3, np.nan, 4]],
                   columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,3.0,5
3,,3.0,,4


In [156]:
df.fillna(0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,3.0,5
3,0.0,3.0,0.0,4


Replace all NaN elements in column 'A', 'B', 'C', and 'D', with the mean of each columns.

In [157]:
values = {"A": np.mean(df.A), "B": np.mean(df.B), "C": np.mean(df.C), "D": np.mean(df.D)}
df.fillna(value = values)

Unnamed: 0,A,B,C,D
0,3.0,2.0,3.0,0
1,3.0,4.0,3.0,1
2,3.0,3.0,3.0,5
3,3.0,3.0,3.0,4


Only replace the first NaN element.