## Libraries and Pandas



__Our goals today are to be able to:__

- Identify and import Python modules and packages (libraries)
- Investigate table data in Pandas
- Manipulate Pandas DataFrames and Series

## Libraries (Packages)

### Terminology

![mod2](img/modules2.png)



### Terminology

![packages3](img/packages3.png)

### pip & the Python Package Index

[Python Package Index](https://pypi.org/)

<img src="img/pypi_packages.png" width=600>

__You can also write your own modules__

Make your own modules
![pipmod](img/import_modules.png)

![pippack](img/package_redo.png)

## Pandas

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=600>  

__Why not spreadsheets?__

[5 and Half Reasons to Ditch the Spreadsheet](https://lucidmanager.org/spreadsheets-for-data-science/)

### Installing and Using Pandas

In [1]:
import pandas 

In [2]:
pandas.__version__

'1.0.3'

In [6]:
## Why pandas
pd?

[1;31mType:[0m        module
[1;31mString form:[0m <module 'pandas' from 'C:\\Users\\Raven (Work)\\.conda\\envs\\learn-env\\lib\\site-packages\\pandas\\__init__.py'>
[1;31mFile:[0m        c:\users\raven (work)\.conda\envs\learn-env\lib\site-packages\pandas\__init__.py
[1;31mDocstring:[0m  
pandas - a powerful data analysis and manipulation library for Python

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating 

In [4]:
## convention

import pandas as pd

### Main Data Structures in Pandas: Series, DataFrame and Index

#### Series

In [7]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [14]:
type(data)

pandas.core.series.Series

In [15]:
data[2]

0.75

In [9]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [19]:
val = data.values
type(val)


numpy.ndarray

In [21]:
ind=data.index
type(ind)

pandas.core.indexes.range.RangeIndex

In [22]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [23]:
data[3]

1.0

In [24]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [25]:
data['b']

0.5

In [27]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

[For more on Series](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.01-Introducing-Pandas-Objects.ipynb)

#### Pandas

The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [41]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area


states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


[Difference between Dataframe and Series](https://stackoverflow.com/questions/26047209/what-is-the-difference-between-a-pandas-series-and-a-single-column-dataframe)

### Importing and Reading Data with Pandas

In [32]:
## Let's check the current directory first
%pwd

'C:\\Users\\Raven (Work)\\Desktop\\Flatiron\\dc-ds-060120\\mod-1\\day-3\\second_session'

In [42]:
## Let's see the files in the current directory
%ls -la

 Volume in drive C is Acer
 Volume Serial Number is 9A2D-3AEE

 Directory of C:\Users\Raven (Work)\Desktop\Flatiron\dc-ds-060120\mod-1\day-3\second_session



File Not Found


In [46]:
import pandas as pd

muj_df = pd.read_csv('data/made_up_jobs.csv', )


We can read a lot of different types of files with pandas: Some examples might be: read_excel, read_html, ect.

In [47]:
type(pd)

module

In [55]:
## Let's take a look at the attributes of Pandas module
dir(pd)

['BooleanDtype',
 'Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NA',
 'NaT',
 'NamedAgg',
 'Panel',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseArray',
 'SparseDataFrame',
 'SparseDtype',
 'SparseSeries',
 'StringDtype',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__Datetime',
 '__DatetimeSub',
 '__SparseArray',
 '__SparseArraySub',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__numpy',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_config',
 '_hashtable',
 '_is_numpy_d

__some methods that will be useful__

- head, tail

- describe

- info

- loc vs iloc?

- values

- renaming columns

- droping columns



In [50]:
muj_df

Unnamed: 0,ID,Name,Job,Years Employed
0,0,Bob Bobberty,Underwater Basket Weaver,13
1,1,Susan Smells,Salad Spinner,5
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50
5,5,Sir Wellington,Cheese Stacker,10


In [54]:
# shape
## returns (rows, columns) as a tuple

muj_df.shape

(6, 4)

In [59]:
# head
muj_df.head(3)

Unnamed: 0,ID,Name,Job,Years Employed
0,0,Bob Bobberty,Underwater Basket Weaver,13
1,1,Susan Smells,Salad Spinner,5
2,2,Alex Lastname,Productivity Manager,2


In [60]:
#tail

muj_df.tail(3)

Unnamed: 0,ID,Name,Job,Years Employed
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50
5,5,Sir Wellington,Cheese Stacker,10


In [76]:
n = muj_df.Name

In [78]:
type(n)

pandas.core.series.Series

In [79]:
type(muj_df['Name'])

pandas.core.series.Series

In [80]:
type(muj_df[['Name']])

pandas.core.frame.DataFrame

In [81]:
muj_df['Name']

0      Bob Bobberty
1      Susan Smells
2     Alex Lastname
3           Rudy P.
4           Rudy G.
5    Sir Wellington
Name: Name, dtype: object

In [93]:
muj_df[['Name','Job']]

Unnamed: 0,Name,Job
0,Bob Bobberty,Underwater Basket Weaver
1,Susan Smells,Salad Spinner
2,Alex Lastname,Productivity Manager
3,Rudy P.,Being cool
4,Rudy G.,Being compared to Rudy P
5,Sir Wellington,Cheese Stacker


In [94]:
muj_df.2_list

SyntaxError: invalid token (<ipython-input-94-c43efe85639d>, line 1)

In [97]:
muj_df.columns.to_list()

['ID', 'Name', 'Job', 'Years Employed']

In [100]:
muj_df.describe()

Unnamed: 0,ID,Years Employed
count,6.0,6.0
mean,2.5,22.5
std,1.870829,23.602966
min,0.0,2.0
25%,1.25,6.25
50%,2.5,11.5
75%,3.75,40.75
max,5.0,55.0


In [101]:
muj_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   ID              6 non-null      int64 
 1   Name            6 non-null      object
 2   Job             6 non-null      object
 3   Years Employed  6 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 320.0+ bytes


In [102]:
muj_df.Name.describe

<bound method NDFrame.describe of 0      Bob Bobberty
1      Susan Smells
2     Alex Lastname
3           Rudy P.
4           Rudy G.
5    Sir Wellington
Name: Name, dtype: object>

In [105]:
muj_df

Unnamed: 0,ID,Name,Job,Years Employed
0,0,Bob Bobberty,Underwater Basket Weaver,13
1,1,Susan Smells,Salad Spinner,5
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50
5,5,Sir Wellington,Cheese Stacker,10


In [107]:
muj_df.iloc[2:5]

Unnamed: 0,ID,Name,Job,Years Employed
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50


In [109]:
muj_df.loc[2:4]

Unnamed: 0,ID,Name,Job,Years Employed
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50


In [111]:
muj_df.index.to_list()

[0, 1, 2, 3, 4, 5]

In [114]:
muj_df = muj_df.set_index(['ID'])

In [115]:
muj_df

Unnamed: 0_level_0,Name,Job,Years Employed
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Bob Bobberty,Underwater Basket Weaver,13
1,Susan Smells,Salad Spinner,5
2,Alex Lastname,Productivity Manager,2
3,Rudy P.,Being cool,55
4,Rudy G.,Being compared to Rudy P,50
5,Sir Wellington,Cheese Stacker,10
