# Getting started with Pandas
+ Pandas is a major tool of interest in data analysis
+ Pandas contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python
+ It is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, data visualisation libraries like matplotlib.
+ While Pandas adopts many coding idioms for Numpy, the biggest difference is that pandas is designed for working with tanular or heterogeneous data
+ Numpy, by contrast is best suited for working with homogeneous numerical array data
+ Pandas has two workhorses data structures
    1. ***Series***
       - a 1D array-like object containing a sequence of values
       - a series is displayed interactively showing the index on the left and the values on the right
       - you can create a series with an index identifying each data point with a label
            ```
            obj2=pd.Series([4,7,-5,3],index=['d','b','a','c'])
            obj2.index

            # Output
            Index(['d', 'b', 'a', 'c'], dtype='object')
            ```
        - here ['d', 'b', 'a', 'c'] is interpreted as a list of indices even though it contains strings instead of ints
        - when you have data contained in a dict, you can create a series from it by passing a dict
            ```
            sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
            obj3=pd.Series(sdata)

            # Output
            Ohio      35000
            Texas     71000
            Oregon    16000
            Utah       5000
            dtype: int64
            ```
        - when you only pass a dict, the index in the resulting series will have the dict's keys in sorted order, you can override this by passing the dict keys in the order you want them to appear
            ```
            states=['California','Ohio','Oregon','Texas']
            obj4=pd.Series(sdata,index=states)
            ```
    2. ***DataFrame***
        - a DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type(numeric, string, boolean, etc)
        - the DataFrame has both a row and column index
        - it can be thought of as a dict of Series all sharing the same index
        - under the hood, the data is stored as one or more 2D blocks rather than a dict, list or some other collection of 1D arrays

In [2]:
import pandas as pd

obj=pd.Series([4,7,-5,3])
# obj
obj.values
# obj.index

#? creating a series with an index identifying each data point with a label
obj2=pd.Series([4,7,-5,3],index=['d','b','a','c'])
obj2
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [22]:

#? creating a series from a dictionary

sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3=pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [21]:

#? creating a series from a dictionary with an index
states=['California','Ohio','Oregon','Texas']
obj4=pd.Series(sdata,index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [23]:

#? detecting missing data using isnull()
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [24]:

#? detecting missing data using notnull()
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [25]:

#? detecting missing data using instance method
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [26]:
# adding a name to the series object using the name attribute
obj4.name='population'

In [27]:
# adding a name to the index using the index.name attribute
obj4.index.name='state'

In [28]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [29]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [30]:
# altering the index of a series in-place by assignment
obj.index=['Bob','Steve','Jeff','Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

#### constructing a DataFrame
1. from a dict of equal-length lists
2. from NumPy array
+ the `head()` method selects only the first five rows
+ you can specify the sequence of columns and the DataFrame's columns will be arranged in that order

In [3]:
# creating a dataframe from a dictionary
establishment={'establishment_number':['M7335','M22001+P22001','I34311','M108+P51+V108','M5687+P5687'],'company':['Ara Food Corp.','Arapahoe Foods Inc.','Paden Cold Inc,','Bay Valley Foods','Bay View Packing Co.'], 'city':['Miami','Lafayette','Norfolk','Pittsburgh','Milwaukee'], 'state':['FL','LA','VA','PA','WI'],'zip':['33101','70506','23505','15212','53209'],'activity':['Meat Processing','Meat Processing','Meat Processing, Poultry Processing','Imported Food','Imported Food']}
df=pd.DataFrame(establishment)
df.head()

Unnamed: 0,establishment_number,company,city,state,zip,activity
0,M7335,Ara Food Corp.,Miami,FL,33101,Meat Processing
1,M22001+P22001,Arapahoe Foods Inc.,Lafayette,LA,70506,Meat Processing
2,I34311,"Paden Cold Inc,",Norfolk,VA,23505,"Meat Processing, Poultry Processing"
3,M108+P51+V108,Bay Valley Foods,Pittsburgh,PA,15212,Imported Food
4,M5687+P5687,Bay View Packing Co.,Milwaukee,WI,53209,Imported Food


In [34]:
# specifying a sequence of columns
df_one=pd.DataFrame(establishment, columns=['state','company','activity','city','zip'])
df_one.head()

Unnamed: 0,state,company,activity,city,zip
0,FL,Ara Food Corp.,Meat Processing,Miami,33101
1,LA,Arapahoe Foods Inc.,Meat Processing,Lafayette,70506
2,VA,"Paden Cold Inc,","Meat Processing, Poultry Processing",Norfolk,23505
3,PA,Bay Valley Foods,Imported Food,Pittsburgh,15212
4,WI,Bay View Packing Co.,Imported Food,Milwaukee,53209


In [5]:
# specifying an index passing a column that doesn't exist in the dict appears with missing values
df_two=pd.DataFrame(establishment, columns=['establishment_number','state','company','activity','city','zip','inspection_deadline'], index=['C1','C2','C3','C4','C5'])
df_two.head()

Unnamed: 0,establishment_number,state,company,activity,city,zip,inspection_deadline
C1,M7335,FL,Ara Food Corp.,Meat Processing,Miami,33101,
C2,M22001+P22001,LA,Arapahoe Foods Inc.,Meat Processing,Lafayette,70506,
C3,I34311,VA,"Paden Cold Inc,","Meat Processing, Poultry Processing",Norfolk,23505,
C4,M108+P51+V108,PA,Bay Valley Foods,Imported Food,Pittsburgh,15212,
C5,M5687+P5687,WI,Bay View Packing Co.,Imported Food,Milwaukee,53209,


In [6]:
# retrieving a column as a series by dict-like notation
df_two['company'].head()

C1          Ara Food Corp.
C2     Arapahoe Foods Inc.
C3         Paden Cold Inc,
C4        Bay Valley Foods
C5    Bay View Packing Co.
Name: company, dtype: object

In [7]:
df_two[['company']].head()

Unnamed: 0,company
C1,Ara Food Corp.
C2,Arapahoe Foods Inc.
C3,"Paden Cold Inc,"
C4,Bay Valley Foods
C5,Bay View Packing Co.


In [8]:
# retrieving a column by attribute
df_two.company

C1          Ara Food Corp.
C2     Arapahoe Foods Inc.
C3         Paden Cold Inc,
C4        Bay Valley Foods
C5    Bay View Packing Co.
Name: company, dtype: object

In [9]:
# retrieving a row by position or name using iloc and loc attribute
df_two.loc['C1']

establishment_number              M7335
state                                FL
company                  Ara Food Corp.
activity                Meat Processing
city                              Miami
zip                               33101
inspection_deadline                 NaN
Name: C1, dtype: object

In [11]:
# accessing a row using the iloc attribute
df_two.iloc[4]

establishment_number             M5687+P5687
state                                     WI
company                 Bay View Packing Co.
activity                       Imported Food
city                               Milwaukee
zip                                    53209
inspection_deadline               2024-12-31
Name: C5, dtype: object

In [10]:
# modifying columns in a dataframe by assignment
df_two['inspection_deadline']='2024-12-31'
df_two.head()

Unnamed: 0,establishment_number,state,company,activity,city,zip,inspection_deadline
C1,M7335,FL,Ara Food Corp.,Meat Processing,Miami,33101,2024-12-31
C2,M22001+P22001,LA,Arapahoe Foods Inc.,Meat Processing,Lafayette,70506,2024-12-31
C3,I34311,VA,"Paden Cold Inc,","Meat Processing, Poultry Processing",Norfolk,23505,2024-12-31
C4,M108+P51+V108,PA,Bay Valley Foods,Imported Food,Pittsburgh,15212,2024-12-31
C5,M5687+P5687,WI,Bay View Packing Co.,Imported Food,Milwaukee,53209,2024-12-31


In [62]:
# modifying a specific row in a dataframe using integer indexes
df_two.iloc[1, 5]='2023-12-31'
# df_two.head()
df_two.iloc[1, 5]

'2023-12-31'

In [12]:
# modifying a specific row with a custom index/label
df_two.loc['C3', 'inspection_deadline']='2022-12-31'
#df_two.head()
df_two.loc['C3', 'inspection_deadline']

'2022-12-31'

In [13]:
# or
df_two.at['C4','inspection_deadline']='2025-12-31'
#df_two.head()
df_two.loc['C4','inspection_deadline']

'2025-12-31'

In [14]:
# checking for data types using the dtypes attribute
df_two.dtypes

establishment_number    object
state                   object
company                 object
activity                object
city                    object
zip                     object
inspection_deadline     object
dtype: object

In [15]:
# type conversion
df_two['inspection_deadline']=pd.to_datetime(df_two['inspection_deadline'])
df_two.dtypes

establishment_number            object
state                           object
company                         object
activity                        object
city                            object
zip                             object
inspection_deadline     datetime64[ns]
dtype: object

In [16]:
# create a new column of boolean vals where city is equal to 'Miami'
df_two['city_in'] = df_two.city == 'Miami'
df_two.head()

Unnamed: 0,establishment_number,state,company,activity,city,zip,inspection_deadline,city_in
C1,M7335,FL,Ara Food Corp.,Meat Processing,Miami,33101,2024-12-31,True
C2,M22001+P22001,LA,Arapahoe Foods Inc.,Meat Processing,Lafayette,70506,2024-12-31,False
C3,I34311,VA,"Paden Cold Inc,","Meat Processing, Poultry Processing",Norfolk,23505,2022-12-31,False
C4,M108+P51+V108,PA,Bay Valley Foods,Imported Food,Pittsburgh,15212,2025-12-31,False
C5,M5687+P5687,WI,Bay View Packing Co.,Imported Food,Milwaukee,53209,2024-12-31,False


In [17]:
# deleting a column
del df_two['city_in']
df_two.head()

Unnamed: 0,establishment_number,state,company,activity,city,zip,inspection_deadline
C1,M7335,FL,Ara Food Corp.,Meat Processing,Miami,33101,2024-12-31
C2,M22001+P22001,LA,Arapahoe Foods Inc.,Meat Processing,Lafayette,70506,2024-12-31
C3,I34311,VA,"Paden Cold Inc,","Meat Processing, Poultry Processing",Norfolk,23505,2022-12-31
C4,M108+P51+V108,PA,Bay Valley Foods,Imported Food,Pittsburgh,15212,2025-12-31
C5,M5687+P5687,WI,Bay View Packing Co.,Imported Food,Milwaukee,53209,2024-12-31


#### constructing a DF with a nested dict of dicts
+ if a nested dict is passed to the DataFrame, ***Pandas*** will interpret the outer dict keys as the columns and the inner keys as the row indices
+ you can transpose the DF(swap rows and columns) with a similar syntax to a ***NumPy*** array
    ```
    df_three.T
    ```

In [71]:
pop={'Nevada':{2001:2.4, 2002:2.9}, 'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
df_three=pd.DataFrame(pop)
df_three.head()

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [72]:
# transposing a dataframe
df_three.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [73]:
# dicts of series
pdata={'Ohio':df_three['Ohio'][:-1], 'Nevada':df_three['Nevada'][:2]}
df_four=pd.DataFrame(pdata)
df_four.head()

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
