# Manipulating Data With Pandas
<img src="https://github.com/FSCJ-FacultyDev/SWC-Virtual-2024/blob/main/notebooks.day4/images/SWC22-Pandas.PythonPandasLogo.jpg?raw=true" width=250 height=150/>

Pandas is a data analysis library built on top of NumPy.
- The name is derived from the term "panel data," which refers to multi-dimensional structured data sets typically used in statistics and econometrics.
- It also plays on the word "Python Data Analysis Library," which reflects the library's primary purpose of providing data manipulation and analysis tools.
- Pandas provides data structures and operations for manipulating data using DataFrames
- DataFrames are multidimensional arrays with attached row and column labels.
- DataFrames can include heterogeneous types and/or missing data.
- Pandas also provides functions for handling data in a similar fashion to database frameworks and spreadsheet programs.
- It relies heavily on NumPy for its core data structures and operations; while you do not need to explicitly import NumPy to use it, NumPy must be installed in your environment for pandas to function correctly.

In [None]:
# use NumPy and Pandas
# import numpy as np # not explicitly required
import pandas as pd
print("Pandas version is", pd.__version__)

# The Series Object
- A Pandas **Series** is a one-dimensional array of indexed data. It can be created from a list or array.
    - A Series wraps both a sequence of values and a sequence of indices, which can be used to access with the values and index attributes.
    - The values are simply a familiar NumPy array


In [None]:
atad = pd.Series([0.52, 0.8, 0.63, 4.0], index = ['a', 'b', 'c', 'd'])
print(atad)
print()
print(atad.values)

# The Series Index
- The Series index is an array-like object of type pd.Index
    - Like with a NumPy array, data can be accessed by the associated index using square-bracket notation
    - The Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates.


In [None]:
data = pd.Series([0.25, 0.5, 0.79, 1.0])
print(data)

#index of the data series
print(data.index)

#the element at index 1
print(data[1])

#a slice of a series (start:stop)
data[1:3]

# Python Dictionaries and Pandas Series

- Series are used for handling and manipulating one-dimensional labeled data in data analysis and manipulation tasks.
  - they are is similar to a specialized Python dictionary
  - a dictionary maps arbitrary keys to a set of arbitrary values; a Series maps typed keys to a set of typed values.
  - the type information of a Pandas Series is much more efficient than Python dictionaries for certain operations.
  - they can be constructed a Series object directly from a Python dictionary:

In [None]:
#create a dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
print(population_dict)
print()
#create a Pandas series from the dictionary
population = pd.Series(population_dict)
print(population)

### Dictionary-style item access can be used with a Series:



In [None]:
print(population['California'])

- Unlike a dictionary, the Series also supports Numpy array-style operations such as slicing:

In [None]:
print(population['California':'New York'])

# Creating a Series
- A Series can be created by instantiating the object:
            pd.Series(data, index)
- **data** can be one of many types of - entities (e.g., list, dictionary, Numpy array).
- **index** specifies the labels for the Series.
  - if not provided, pandas will default to a RangeIndex starting from 0.
  - it can be any array-like structure of the same length as data (e.g., list, array, or pandas Index object).

In [None]:
# simple scalar series
print(pd.Series([2, 4, 6]))

print("\nfilled series")
# scalar series, fill with 5's and specify index
pd.Series(5, index=[100, 200, 300])

print("\nsimple dictionary-based series")
# simple dictionary-based series
pd.Series({2:'a', 1:'b', 3:'c'})

print("\npopulated using only specified keys")
# populate using only specified keys
print(pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2]))

## Try It!

Create a pandas Series that represents the scores of three students in a test.

**Sample Output**

```
Alice      85
Bob        90
Charlie    78
dtype: int64
```


# DataFrames
- The DataFrame can also be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
- A DataFrame is comparable to a two-dimensional array with both flexible row indices and flexible column names.
- Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects.



### Construct a new area Series which parallels the population Series created earlier, then create a two-dimensional DataFrame using those objects

In [None]:
#recall the population_dict from above
print("Population Dictionary:")
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
print(population_dict)


#the Pandas Series from that dictionary
print("\nPopulation Series:")
population = pd.Series(population_dict)
print(population)

#create a new area dictionary for the same states (in sq km)
print("\nArea Dictionary:")
area_dict = {'California': 423967,
             'Texas': 695662,
             'New York': 141297,
             'Florida': 170312,
             'Illinois': 149995}
print(area_dict)


#create a Pandas Series from the area dictionary
print("\nArea Series:")
area = pd.Series(area_dict)
print(area)

print("\nDataFrame:")
#create a DataFrame from the two Series
states = pd.DataFrame({'population': population,
                       'area': area})
print(states)

In [None]:
#create a DataFrame from a list of dictionaries
data = [{'a': i, 'b': 2 * i} for i in range(3)]   # remember list comprehensions?
print("\nList of Dictionaries:")
print(data)
df = pd.DataFrame(data)
print("\nDataFrame:")
print(df)

In [None]:
#create a DataFrame from a 2D array
import numpy as np # now we need numpy!

df = pd.DataFrame(np.random.rand(3, 2),
      columns=['foo', 'bar'],
      index=['a', 'b', 'c'])
print(df)

#if the names are omitted, an integer index will be used for each
print("\ndata frame with omitted index names")
df = pd.DataFrame(np.random.rand(3, 2))
print(df)

## Try It!

Create a pandas DataFrame from the following dictionary that represents students' scores in different subjects:

```
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Math': [85, 90, 78],
    'Science': [92, 88, 95],
    'English': [87, 85, 90]
}
```

Sample Output

```
      Name  Math  Science  English
0    Alice    85       92       87
1      Bob    90       88       85
2  Charlie    78       95       90
```

### DataFrame attributes

- DataFrames have an index and a column attribute

In [None]:
print("States DataFrame:")
print(states)

#index refers to the row headings
print("\nIndex:")
print(states.index)

print("\nColumns:")
print(states.columns)

#DataFrames use column values as indices to a series
print("\nColumn Values as Indices")
states['area']

## Missing Values
- Missing values are filled with NaN ("not-a-number")
- This behavior is important; in data science missing values can impact analytical results and should be dealt with consistently
- https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b


In [None]:
print("List of Dictionaries:")
data=[{'a': 1, 'b': 2},
      {'b': 3, 'c': 4}]
print(data)

print("\nDataFrame with NaNs")
# pandas aligns the keys to form columns: a, b, and c.
# The first dictionary has keys a and b, so it has NaN for column c.
# The second dictionary has keys b and c, so it has NaN for column a.
print(pd.DataFrame(data))

# Series as a Dictionary
- Like a dictionary, the series object provides a mapping from a collection of keys to a collection of values

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index=['a', 'b', 'c', 'd'])
print(data)
print(data['b'])

### We can use dictionary-like expressions and methods to examine keys/indices and values

In [None]:
'a' in data
print(data.keys())
print(list(data.items()))

### Series objects can even be modified with a dictionary-like syntax
- As a dictionary can be extended by assigning to a new key, extend a series by assigning a value to a new index

In [None]:
print(data)
data['e'] = 1.25
print(data)

# Series as a One-dimensional Array
- A series provides array-style item selection using the same basic mechanisms as NumPy array, including slices and masks

In [None]:
#slicing by explicit index
#NOTE: includes last index!!

print(data['a':'c'])

#slicing by implicit integer index
#NOTE: does NOT include the last index!

print(data[0:2])

#masking
print(data[(data > 0.3) & (data < 0.8)])

# DataFrame as Dictionary
- The individual series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':population})
print(data)
print(data['area'])

In [None]:
#we can use attribute-style access with column names that are strings

print(data.area)

#use the **is** operator to compare identities of each object

print(data.area is data['area'])

#the pop method of a data frame
print(data.pop('area'))

print(data.columns)

### Dictionary-style syntax can also be used to modify the DataFrame

In [None]:
#first, add back the column that was popped previously

data['area'] = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
print(data)

#compute density and add the density column to the DataFrame

data['density'] = data['pop'] / data['area']

print(data)

# DataFrame as a Two-dimensional (2D) array
- The DataFrame can be viewed as an enhanced two-dimensional array
 - View the underlying data array using the ndarray **values** attribute

In [None]:
data.values

- The **T** attribute in a pandas DataFrame is used to transpose the DataFrame, swapping its rows and columns.


In [None]:
data.T

# Handling Missing Values
- Real-world data is rarely clean and homogeneous
 - Many datasets will have missing data
- Different data sources may indicate missing data in different ways
 - Using a mask
  - as a separate Boolean array
  - as a single bit in the data representation
 - Using a sentinel value
  - a data-specific convention, e.g. for missing integers use –9999 or some rare bit pattern
  - NaN (Not a Number) for missing floating point values
   - NaN is part of the IEEE floating-point specification
- No universally common choice exists, different languages and systems use different conventions.
- Pandas uses sentinels for missing data, using two already-existing Python null values: the special floating-point NaN value, and the Python None object
 - This results in some side effects, but in practice is a good compromise.
- None is a "singleton" object – there can exist only one.
 - -- "The None keyword is used to define a null value, or no value at all. None is not the same as 0, False, or an empty string. None is a datatype of its own (NoneType) and only None can be None"
https://www.w3schools.com/python/ref_keyword_none.asp


# Using None in NumPy
- Because None is a Python object, it cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays of Python objects).


In [None]:
vals1 = np.array([1, None, 3, 4])
print(vals1)
print(vals1.dtype)

- dtype=object means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects.
- Object arrays can be useful, but operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types
 - If you perform aggregations like sum() or min() across an array with a None value, you will generally get an error since operations between numbers and None are undefined


# Using NaN in NumPy
- The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation.
 - NaN is specifically a floating-point value; there is no equivalent NaN value for integers, strings, or other types.
- NumPy uses a native floating-point type for arrays containing NaN.
 - Unlike the object array from before, this array supports fast operations pushed into compiled code.


In [None]:
vals2 = np.array([1, np.nan, 3, 4])
print(vals2)
print(vals2.dtype)

- "NaN is a bit like a data virus—it infects any other object it touches."
 - [VanderPlas, Python Data Science Handbook]
- NaN values will propagate through numeric operations
 - The result of arithmetic with a NaN will be another NaN


In [None]:
print(1 + np.nan)
print(0 *  np.nan)

- Using aggregate functions with NaN values does not result in errors, but the results aren't very useful


In [None]:
print(vals2)
print(vals2.sum())
print(vals2.min())
print(vals2.max())

- NumPy provides special aggregations that will ignore these missing values:


In [None]:
print(vals2)
print(np.nansum(vals2))
print(np.nanmin(vals2))
print(np.nanmax(vals2))

# NaN and None in Pandas
- Pandas is built to handle NaN and None nearly interchangeably, converting between them where appropriate



In [None]:
print(pd.Series([1, np.nan, 2, None]))

- When integer values are set to NaN or None, the data type is automatically up-cast to floating point
 - None is automatically converted to a NaN value


In [None]:
x = pd.Series(range(2), dtype=int)
print(x)
x[0] = None
print(x)

# Operating on Null Values
- There are several methods for detecting, removing, and replacing null values in Pandas data structures.
 - isnull()
 -- Generate a Boolean mask indicating missing values
 - notnull()
 -- Opposite of isnull()
 - dropna()
 -- Return a filtered version of the data
 - fillna()
 -- Return a copy of the data with missing values filled or imputed


# Detecting Null Values
- Using isnull()

In [None]:
data = pd.Series([1, np.nan, 'hello', None])
print(data)
print(data.isnull())

- Boolean masks can be used to index Series or DataFrames

In [None]:
print(data.notnull())
print(data[data.notnull()])

# Dropping Null Values in Series
- The dropna() method returns a new Series without the nulls

In [None]:
data
print(data.dropna())

- By default, dropna() will drop full rows with any nulls


In [None]:
df = pd.DataFrame([ [1,      np.nan, 2],
                    [2,      3,      5],
                    [np.nan, 4,      6]])
print(df)
print(df.dropna())

- axis='columns' drops all columns containing a null value


In [None]:
print(df)
df.dropna(axis='columns')

- the thresh argument specifies a minimum drop threshold

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [8,      np.nan, np.nan],
                   [np.nan, np.nan, np.nan]])
print(df)
print(df.dropna(thresh=2))

# Filling Null Values in Series
- Pandas provides the fillna() method to replace the null values in a Series
 - fillna() returns a copy of the array with the null values replaced


In [None]:
data = pd.Series([1, np.nan, 2, None, 3],
                 index=list('abcde'))
print(data)
print("\nFillNA:")
print(data.fillna(0))

- Pandas provides a method argument with fillna() to specify how to fill the values
 - Choices are  ‘bfill’ (back fill), ‘ffill’ (forward fill), and None (defaults to None)
 - ffill ("fill forward") propagates the last valid observation forward to the next valid
  - bfill ("backward fill") propagates the next valid observation backward to the previous valid

In [None]:
print(data)
print("\nForward Fill")
print(data.fillna(method='ffill'))
print("\nBack Fill")
print(data.fillna(method='bfill'))

- Fillna() options for DataFrames are similar to Series
- An axis for fills can be specified
 - Choices are axis = 0 for index (row) (this is the default) and 1 for column
 - If first value is NaN, subsequent fill does not occur


In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [8,      np.nan, np.nan],
                   [np.nan, 4,      6]])
print("\nDataFrame:")
print(df)

print("\nIndex/Row fill")
# in column 0, the NaN at index 3 is filled with 8.0
# (the last valid observation before it).
print(df.fillna(method='ffill', axis=0))   # index fill

print("\nColumn fill")
# in the first row, the NaN in column 1 is filled with 1.0
# (the last valid observation before it in the same row
print(df.fillna(method='ffill', axis=1))  # column fill