# Pandas for Data Science
## Loading Manipulating and Cleaning Files

## What is Pandas?
Pandas is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. — Wikipedia

Though as data scientists, we mostly see stunning visualization and state-of-the-art machine learning algorithms, however, the back bone of most data science projects is Pandas. It is important therefore as a data scientist to have the skill set in Pandas because it will help you a lot in most, if not all, of your data science projects.

### How does pandas fit into the data science toolkit?
Not only is the pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.

Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn. 

Source: [https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/]

### Import Pandas

In [15]:
import numpy as np
import pandas as pd

### Creating Your Dataframe
There are multiple ways to create a data frame in Pandas

#### From Scratch

In [None]:
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])

df

#### From a File
This is one of the most common ways to load data in Pandas and process it using Python. You can load different data types in Pandas such as:
1. Comma-separated values (CSV)
2. JSON
3. HTML
4. Excel
5. SQL
6. Pickle File

Please follow this
[link](https://realpython.com/pandas-read-write-files/#working-with-different-file-types) for more information on loading these file types.

#### From a URL
You can directly get data from a url using pandas. The number of tables will depend on the webpage you are trying to access.

In [12]:
# Webpage url                                                                                                               
url = 'https://en.wikipedia.org/wiki/History_of_Python'

# Extract tables
dfs = pd.read_html(url)

# Get first table                                                                                                           
df = dfs[0]

# Extract columns                                                                                                           
df2 = df[['Version','Release date']]
df2

Unnamed: 0,Version,Release date
0,0.9,1991-02-20[2]
1,1.0,1994-01-26[2]
2,1.1,1994-10-11[2]
3,1.2,1995-04-13[2]
4,1.3,1995-10-13[2]
5,1.4,1996-10-25[2]
6,1.5,1998-01-03[2]
7,1.6,2000-09-05[40]
8,2.0,2000-10-16[42]
9,2.1,2001-04-15[43]


In [13]:
# Webpage url                                                                                                               
url = 'https://usefulwall.com/2018/03/salary-grade-table-2018-philippine-salary-standardization-law/'

# Extract tables
dfs = pd.read_html(url)

# Get first table                                                                                                           
df = dfs[0]
df

Unnamed: 0,Salary Grade,Sample Positions,Step 1,Step 2,Step 3,Step 4,Step 5,Step 6,Step 7,Step 8
0,1,Utility Worker I,"₱11,068","₱11,160","₱11,254","₱11,348","₱11,443","₱11,538","₱11,635","₱11,732"
1,2,Messenger,"₱11,761","₱11,851","₱11,942","₱12,034","₱12,126","₱12,219","₱12,313","₱12,407"
2,3,Clerk I,"₱12,466","₱12,562","₱12,658","₱12,756","₱12,854","₱12,952","₱13,052","₱13,152"
3,4,Driver II,"₱13,214","₱13,316","₱13,418","₱13,521","₱13,625","₱13,729","₱13,835","₱13,941"
4,5,Carpenter II,"₱14,007","₱14,115","₱14,223","₱14,332","₱14,442","₱14,553","₱14,665","₱14,777"
5,6,Lab Technician I,"₱14,847","₱14,961","₱15,076","₱15,192","₱15,309","₱15,426","₱15,545","₱15,664"
6,7,Computer Operator I,"₱15,738","₱15,859","₱15,981","₱16,104","₱16,227","₱16,352","₱16,477","₱16,604"
7,8,Assistant Engineer,"₱16,758","₱16,910","₱17,063","₱17,217","₱17,372","₱17,529","₱17,688","₱17,848"
8,9,Electrician Foreman,"₱17,975","₱18,125","₱18,277","₱18,430","₱18,584","₱18,739","₱18,896","₱19,054"
9,10,Legal Assistant I,"₱19,233","₱19,394","₱19,556","₱19,720","₱19,884","₱20,051","₱20,218","₱20,387"


### Manipulating the Data Frame
Read the content of this [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) to know the basics of manipulating the dataframe.

### Data Cleaning
As data scientists, it is a must that we know how to clean our data. Before we do anything, such as EDA, we need to make sure that we can trust our data.

#### Dealing with Missing Values
One of the common problems in real-world data is completeness. There are different approaches in dealing with this that mainly depends on factors such as the context and the nature of the data, as well as the objective of the study.

Let us first create a simple data frame with missing values.

In [37]:
df = pd.DataFrame({"name": ['Clark', 'bruce', 'Diana','barry','Victor',np.nan, 'arthur'],
                    "age": [np.nan, 34, 33,24,31,np.nan,np.nan],
                    "born": [pd.NaT, pd.Timestamp("1986-04-25"),pd.Timestamp("1987-07-28"),pd.Timestamp("1997-07-28"),
                             pd.NaT,pd.NaT,pd.Timestamp("1990-04-25") ]})
df

Unnamed: 0,name,age,born
0,Clark,,NaT
1,bruce,34.0,1986-04-25
2,Diana,33.0,1987-07-28
3,barry,24.0,1997-07-28
4,Victor,31.0,NaT
5,,,NaT
6,arthur,,1990-04-25


#### Dropping of Missing Values

To drop all rows with at least one missing value, use dropna.

In [38]:
df.dropna()
df

Unnamed: 0,name,age,born
0,Clark,,NaT
1,bruce,34.0,1986-04-25
2,Diana,33.0,1987-07-28
3,barry,24.0,1997-07-28
4,Victor,31.0,NaT
5,,,NaT
6,arthur,,1990-04-25


As you can observe, the changes were not reflected. There are two ways for you to realize the changes here. 

First, you may equate it to another variable. This retains the original dataframe. See the next cell.

In [39]:
df1 = df.dropna()
df1

Unnamed: 0,name,age,born
1,bruce,34.0,1986-04-25
2,Diana,33.0,1987-07-28
3,barry,24.0,1997-07-28


Here, you can see that the new dataframe, df1, reflected the dropping of the rows.

Also, the original dataframe, df, is still the same.

In [40]:
df

Unnamed: 0,name,age,born
0,Clark,,NaT
1,bruce,34.0,1986-04-25
2,Diana,33.0,1987-07-28
3,barry,24.0,1997-07-28
4,Victor,31.0,NaT
5,,,NaT
6,arthur,,1990-04-25


The second method is to use the argument inplace=True. This will modify the original data frame.


In [41]:
df.dropna(inplace=True)
df

Unnamed: 0,name,age,born
1,bruce,34.0,1986-04-25
2,Diana,33.0,1987-07-28
3,barry,24.0,1997-07-28


More often, dropping all rows with at least one missing value is not helpful. With this, you may select the specific column where to look at.

Lets first recreate our dataframe.

In [42]:
df = pd.DataFrame({"name": ['Clark', 'bruce', 'Diana','barry','Victor',np.nan, 'arthur'],
                    "age": [np.nan, 34, 33,24,31,np.nan,np.nan],
                    "born": [pd.NaT, pd.Timestamp("1986-04-25"),pd.Timestamp("1987-07-28"),pd.Timestamp("1997-07-28"),
                             pd.NaT,pd.NaT,pd.Timestamp("1990-04-25") ]})
df

Unnamed: 0,name,age,born
0,Clark,,NaT
1,bruce,34.0,1986-04-25
2,Diana,33.0,1987-07-28
3,barry,24.0,1997-07-28
4,Victor,31.0,NaT
5,,,NaT
6,arthur,,1990-04-25


The subset argument defines which column to look at. In this example it is looking at the age column.

In [46]:
df.dropna(subset=['age'])

Unnamed: 0,name,age,born
1,bruce,34.0,1986-04-25
2,Diana,33.0,1987-07-28
3,barry,24.0,1997-07-28
4,Victor,31.0,NaT


The how='all argument will remove rows where ALL columns are missing.

In [48]:
df.dropna(how='all')

Unnamed: 0,name,age,born
0,Clark,,NaT
1,bruce,34.0,1986-04-25
2,Diana,33.0,1987-07-28
3,barry,24.0,1997-07-28
4,Victor,31.0,NaT
6,arthur,,1990-04-25


You may also define a threshold. In this case, rows with missing values on two or more columns will be dropped.

In [50]:
df.dropna(how='any', thresh=2)

Unnamed: 0,name,age,born
1,bruce,34.0,1986-04-25
2,Diana,33.0,1987-07-28
3,barry,24.0,1997-07-28
4,Victor,31.0,NaT
6,arthur,,1990-04-25


It may also be useful to see information about the number of missing values.

In [54]:
df.isna()

Unnamed: 0,name,age,born
0,False,True,True
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,True
5,True,True,True
6,False,True,False


In [55]:
df.isna().sum()

name    1
age     3
born    3
dtype: int64

#### Imputing Values

Load the dataset temp_data.csv.

As you can see, there are missing values.

In [56]:
df_temp = pd.read_csv('temp_data.csv')
df_temp

Using SimpleImputer from sklearn, we can impute the missing values.

In [80]:
from sklearn.impute import SimpleImputer
#mean, median, most_frequent
imr = SimpleImputer(missing_values=0, strategy='mean')
imr = imr.fit(df_temp.values)
imputed_data = imr.transform(df_temp.values)
imputed_data

array([[ 1.        , 29.11165058, 16.37888044],
       [ 2.        , 31.54673092, 49.33976706],
       [ 3.        , 32.92208599, 59.7666668 ],
       [ 4.        , 33.62165324, 44.80376982],
       [ 5.        , 26.25555815, 48.8457166 ],
       [ 6.        , 29.39663351, 10.44781229],
       [ 7.        , 30.37686836, 82.99743032],
       [ 8.        , 33.56181579,  9.7791989 ],
       [ 9.        , 27.30232699, 43.29480771],
       [10.        , 31.89551822, 67.21842947],
       [11.        , 33.99075254, 48.8457166 ],
       [12.        , 28.19147808, 63.69334514],
       [13.        , 30.37686836, 71.18550378],
       [14.        , 27.45714049, 12.33424667],
       [15.        , 28.25756504, 11.26109318],
       [16.        , 32.53312056, 48.8457166 ],
       [17.        , 28.62488488, 43.6082156 ],
       [18.        , 30.37686836, 91.75609278],
       [19.        , 26.87048435, 68.12807287],
       [20.        , 33.99823221, 95.73815775],
       [21.        , 33.08297756, 48.845

In [81]:
df_temp_imp = pd.DataFrame(imputed_data)

In [82]:
df_temp_imp

Unnamed: 0,0,1,2
0,1.0,29.111651,16.37888
1,2.0,31.546731,49.339767
2,3.0,32.922086,59.766667
3,4.0,33.621653,44.80377
4,5.0,26.255558,48.845717
5,6.0,29.396634,10.447812
6,7.0,30.376868,82.99743
7,8.0,33.561816,9.779199
8,9.0,27.302327,43.294808
9,10.0,31.895518,67.218429


In [83]:
df_temp

Unnamed: 0,Time,Temperature,Relative Humidity
0,1,29.111651,16.37888
1,2,31.546731,49.339767
2,3,32.922086,59.766667
3,4,33.621653,44.80377
4,5,26.255558,0.0
5,6,29.396634,10.447812
6,7,0.0,82.99743
7,8,33.561816,9.779199
8,9,27.302327,43.294808
9,10,31.895518,67.218429


The 'most_frequent' strategy may come in handy for non-numeric values.

In [86]:
df = pd.DataFrame([["ANNA", "x"],
             [np.nan, "y"],
                   ["ANNA", np.nan],
                  ["KARA", "y"]], dtype="category")
df

Unnamed: 0,0,1
0,ANNA,x
1,,y
2,ANNA,
3,KARA,y


In [90]:
imr = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data
df_imp = pd.DataFrame(imputed_data)
df_imp

Unnamed: 0,0,1
0,ANNA,x
1,ANNA,y
2,ANNA,y
3,KARA,y


#### Labels and One Hot Encoding

In [91]:
from sklearn.preprocessing import LabelEncoder
# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])
# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
bridge_df['Bridge_Types_Cat'] = labelencoder.fit_transform(bridge_df['Bridge_Types'])
bridge_df

Unnamed: 0,Bridge_Types,Bridge_Types_Cat
0,Arch,0
1,Beam,1
2,Truss,6
3,Cantilever,3
4,Tied Arch,5
5,Suspension,4
6,Cable,2


In [92]:
from sklearn.preprocessing import OneHotEncoder
# creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')
# passing bridge-types-cat column (label encoded values of bridge_types)
enc_df = pd.DataFrame(enc.fit_transform(bridge_df[['Bridge_Types_Cat']]).toarray())
# merge with main df bridge_df on key values
bridge_df = bridge_df.join(enc_df)
bridge_df

Unnamed: 0,Bridge_Types,Bridge_Types_Cat,0,1,2,3,4,5,6
0,Arch,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Beam,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,Truss,6,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,Cantilever,3,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,Tied Arch,5,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,Suspension,4,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6,Cable,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [93]:
# creating initial dataframe
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])
# generate binary values using get_dummies
dum_df = pd.get_dummies(bridge_df, columns=["Bridge_Types"], prefix=["Type_is"] )
# merge with main df bridge_df on key values
bridge_df = bridge_df.join(dum_df)
bridge_df

Unnamed: 0,Bridge_Types,Type_is_Arch,Type_is_Beam,Type_is_Cable,Type_is_Cantilever,Type_is_Suspension,Type_is_Tied Arch,Type_is_Truss
0,Arch,1,0,0,0,0,0,0
1,Beam,0,1,0,0,0,0,0
2,Truss,0,0,0,0,0,0,1
3,Cantilever,0,0,0,1,0,0,0
4,Tied Arch,0,0,0,0,0,1,0
5,Suspension,0,0,0,0,1,0,0
6,Cable,0,0,1,0,0,0,0


#### Dropping Duplicates

In [103]:
import pandas as pd
df = pd.DataFrame([
['Monkey', 'Small', 2 ,'0'],
 ['Monkey', 'Medium', 4, '0'],   
['Lion', 'Large', 2, '1'],
['Tiger', 'Large', 3, '1'],
['Tiger', 'Small', 1, '1'],
['Tiger', 'Small', 1, '1']])
df.columns = ['Animal', 'Size', 'Age', 'Target']
df

Unnamed: 0,Animal,Size,Age,Target
0,Monkey,Small,2,0
1,Monkey,Medium,4,0
2,Lion,Large,2,1
3,Tiger,Large,3,1
4,Tiger,Small,1,1
5,Tiger,Small,1,1


In [104]:
df.drop_duplicates(subset=['Size'])

Unnamed: 0,Animal,Size,Age,Target
0,Monkey,Small,2,0
1,Monkey,Medium,4,0
2,Lion,Large,2,1


In [105]:
#keep the first one
df.drop_duplicates(subset=['Size'],keep='first')

Unnamed: 0,Animal,Size,Age,Target
0,Monkey,Small,2,0
1,Monkey,Medium,4,0
2,Lion,Large,2,1


In [106]:
#keep the last one
df.drop_duplicates(subset=['Size'],keep='last')

Unnamed: 0,Animal,Size,Age,Target
1,Monkey,Medium,4,0
3,Tiger,Large,3,1
5,Tiger,Small,1,1


In [107]:
df.drop_duplicates(keep=False)

Unnamed: 0,Animal,Size,Age,Target
0,Monkey,Small,2,0
1,Monkey,Medium,4,0
2,Lion,Large,2,1
3,Tiger,Large,3,1


In [108]:
df

Unnamed: 0,Animal,Size,Age,Target
0,Monkey,Small,2,0
1,Monkey,Medium,4,0
2,Lion,Large,2,1
3,Tiger,Large,3,1
4,Tiger,Small,1,1
5,Tiger,Small,1,1


   In this activity of learning data processing in Python with Pandas, I came across several fundamental concepts that have proved to be crucial in my understanding of data analysis. The key take-away I have for this activity will be discuss further. The first and foremost thing that I learned was how to load data into Pandas and process it. I discovered that Pandas provides a variety of functions that allow us to load data from different file formats such as CSV, JSON, HTML, Excel, etc. Using Pandas to load data from a file is quite easy and efficient, and it can be done in just a few lines of code. Moreover, I learned that we can directly get data from a URL using Pandas, which can be very convenient when we need to input data from the web. This feature of Pandas makes it possible for us to access a wide range of data sets that are available online and use them for our analysis.

   Apart from loading data, I also learned how to deal with missing values in the data frame. Missing data is a common problem in real-world data sets, and it is essential to handle it correctly to avoid any errors or biases in the analysis. I learned that we can use the dropna function in Pandas to drop all rows that have missing values, or we can use the fillna function to fill the missing values with some other values. Furthermore, I learned that we can load the missing values using SimpleImputer from sklearn. SimpleImputer is a powerful tool that allows us to fill in missing values with various strategies such as the mean, median, mode, or a user-defined value. This technique is helpful when we have a large amount of missing data, and we want to impute the missing values efficiently. Lastly, I discovered that we can drop duplicates from the data to clean the data set and remove any redundant information. The drop_duplicates function in Pandas allows us to drop all rows that have the same values in all columns or a subset of columns. This feature can be handy when we are working with data sets that have duplicate entries, and we want to eliminate them to avoid any confusion in our analysis.

   In summary, my experience with this activity has been a fantastic learning journey, and I have learned many valuable skills that I can apply in my future projects and activities. From loading data into Pandas and processing it to handling missing data and dropping duplicates, I now have a strong foundation in data processing with Pandas.