<a href="https://colab.research.google.com/github/Angie-O/Learning-Pandas/blob/Understanding-Dataframes/PANDAS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction**

Pandas is a python library used to analyze tabular data.

In [4]:
# Before using Pandas we import the pandas library

import pandas as pd

#**Understanding Data Frames**

### Creating Data Frames
A Dataframe is a 2-dimensional labeled data structure with columns and rows

In [6]:
# Method 1: Creating a Dataframe from a dictionary

#my dictionary
my_dictionary = [{'name': 'Anne', 'age': 34}, {'name': 'David', 'age': 25}]

#creating my_dataframe from my_dictionary defined above
my_dataframe = pd.DataFrame.from_dict(my_dictionary, orient='columns')

# printing out the dataframe
my_dataframe

Unnamed: 0,name,age
0,Anne,34
1,David,25


OBSERVATION:
* The result is new dataframe from dictionary values


---


<font color = 'pink'>NOTE:
Everytime you create a dataframe it will automatically assign indexes to the row.</font>




In [2]:
# Method 2:Creating a DataFrame by inserting rows iteratively

# For this example, we will use the randint() function 
# thus we will need to import it
from random import randint

# declare the columns that we will need 
columns = ['a', 'b', 'c']

# creating dataframe
my_dataframe = pd.DataFrame(columns=columns)

# lastly append random values to the dataframe iteratively using a for loop.
# We are going to use two for loops. the first one will be for the number of rows and the second one will be for the number of columns.
# In the outer loop, we will create a range of number from 0-5, then iterate through it. This means that we will have 6 rows
# We'll explain the logic of this code from the inside out. 
# So inside the inner loop, we  aim to populate our dataframe with random integers that are between -1 and 1.
# Hence everytime we generate out random number, we use pandas dataframe method called loc[] to insert the random number in either of the three columns
# The .loc method works in the same way slicing works in python list. This means that it can be used to access elements inside a dataframe. As such, we can also use it to update elements in a dataframe. When we created the  empty dataframe earlier, it meant that the elements were null. So in our code we are simply updating the null elements with values.
# The logic for this code can be alittle bit confusing at first, so spend a little bit of time with your pair trying understand how the code works as it will help you alot in the future.
for i in range(7): #6 rows
    for c in columns:
      my_dataframe.loc[i,c] = randint(-1,1)
  
# printing out the dataframe
my_dataframe


Unnamed: 0,a,b,c
0,1,-1,-1
1,0,0,1
2,0,1,-1
3,1,-1,0
4,-1,-1,-1
5,-1,1,-1
6,1,-1,1


OBSERVATION:
* The result is new dataframe with random values according to the function given


---


<font color = 'pink'>NOTE:
Everytime you create a dataframe it will automatically assign indexes to the row.</font>




In [3]:
# Method 3: Creating a Dataframe with randomly generated data

# We will import and use numpy in this case
import numpy as np

np_mat = np.random.randint(0,5,size=(5, 4))

np_mat

# Uncomment the following lines after running the previous lines 

# creating dataframe
df = pd.DataFrame(np_mat, columns=list('ABCD'))

# printing out the dataframe
df

Unnamed: 0,A,B,C,D
0,0,3,1,1
1,1,3,2,0
2,4,3,0,4
3,4,3,2,2
4,1,3,3,1


OBSERVATION:
* The result is new dataframe with random data according to the function given


---


<font color = 'pink'>NOTE:
Everytime you create a dataframe it will automatically assign indexes to the row.</font>




In [7]:
# Method 4: Creating a Dataframe from a file or url
# Pandas allows importing data from various file formats such as comma-separated values(csv), JSON, Parquet, SQL database tables or queries, and Microsoft Excel.
# For csv files we use read_csv, for json = read_json, for excel = read_excel.

#reading from file path
df = pd.read_csv('sample_data/california_housing_test.csv', delimiter = ',') #for csv
# df = pd.read_json(filepath, orient='columns') #for json
# df = pd.read_excel(filepath, sheetname=0, header=1) #for excel file

#reading from url
df = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv") #for csv
# df = pd.read_json(url, orient='columns') #for json
# df = pd.read_excel(url, sheetname=0, header=1) #for excel file

# printing out the dataframe
df

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA
...,...,...
189,Paraguay,SOUTH AMERICA
190,Peru,SOUTH AMERICA
191,Suriname,SOUTH AMERICA
192,Uruguay,SOUTH AMERICA


OBSERVATION:
* The result is new dataframe with values from the url or file given


---


<font color = 'pink'>NOTE:
Everytime you create a dataframe it will automatically assign indexes to the row.</font>




### Changing Dataframe column names





In [9]:
# Using index to edit as shown below

# Creating DataFrame
df_list = [['AA', 1, 'a'],['BB', 2, 'a'],['CC', 3, 'a']]
df = pd.DataFrame(df_list, columns = ['name','value','salue'])

# Using index to edit using a for loop
df.columns.values[1:] = ['prefix_' + val for val in df.columns.values[1:]]
df.columns.values

# Printing the dataframe
df

Unnamed: 0,name,prefix_value,prefix_salue
0,AA,1,a
1,BB,2,a
2,CC,3,a


OBSERVATION:
* Column names change following the given set of instructions


In [19]:
# Changing Dataframe column names by listing the column names in order

# Creating DataFrame
df_list = [['AA', "temp", 1],['BB', "temp", 2],['CC', "temp", 3]]
df = pd.DataFrame(df_list, columns = ['name','temp', 'value'])

# Listing the column names in order 
df.columns = ['names', 'temperature', 'values']

# Printing the dataframe
df

Unnamed: 0,names,temperature,values
0,AA,temp,1
1,BB,temp,2
2,CC,temp,3


OBSERVATION:
* Column names change following the new list


### Choosing specific columns from a DataFrame


This is done by listing the columns to choose as shown below:

In [18]:
# Creating DataFrame
df_list = [['AA', "temp", 1],['BB', "temp", 2],['CC', "temp", 3]]
df = pd.DataFrame(df_list, columns = ['name','temp', 'value'])

# Listing the columns to choose
df = df[["name","temp"]]

# Printing the dataframe
df

Unnamed: 0,name,temp
0,AA,temp
1,BB,temp
2,CC,temp


### Deleting/dropping columns or extracting columns from Dataframe 


*   drop()
*   pop()



---



<font color = 'pink'>**NOTE**:
drop() will return the DataFrame with the column removed but pop() will return the column.</font>


In [16]:
# Example 8
# Deleting/dropping columns or extracting columns from Dataframe 
# 
df_list = [['AA', "temp", 1],['BB', "temp", 2],['CC', "temp", 3]]
df = pd.DataFrame(df_list, columns = ['name','value','temp'])
df

# Delete a column, (column vlue in this case)
df.drop('value', axis=1, inplace=True)

# Printing the dataframe
df

Unnamed: 0,name,temp
0,AA,1
1,BB,2
2,CC,3


OBSERVATION:
* The result for df.drop() is a new dataframe with deleted columns.


In [17]:
# Creating DataFrame
df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])
df

# Uncomment the following lines after running the previous commented lines
dropped_values = df.pop('value')

# Show dropped valued(the result of df.pop('value))
dropped_values

0    1
1    2
2    3
Name: value, dtype: int64

OBSERVATION:
* df.pop() shows values of the deleted columns.



---


<font color = 'pink'>**NOTE**:
drop() will return the DataFrame with the column removed but pop() will return the column.</font>


