## Python Data Science Methodology

Access - Identify analysis tables that will be used in load those tables.<br> 
Explore/Investigate - Inspect tables to determine whether any changes are needed for data items due to data inconsistencies or data quality issues, as well as identify any new data items <br>that need to be calculated. <br>
Prepare - Correct any data quality issues and create any new calculated items needed for analysis. <br>
Analyze - Explore  data to identify any patterns, relationships, and trends. <br>
Report - Develop interactive reports that can be shared via the web or a mobile device.

### importing an image

In [1]:
from IPython.display import display, HTML
 
image_url = "https://www.canr.msu.edu/home_gardening/uploads/images/Photo1-Rusty.jpg?language_id=1"
html_code = f'<img src="{image_url}" width="300">'
 
display(HTML(html_code))

Bumble bees are important pollinators of wild flowering plants and agricultural crops. They are able to fly in cooler temperatures and lower light levels than many other bees, making them excellent pollinators—especially at higher elevations and latitudes. They also perform a behavior called “buzz pollination,” in which the bee grabs the flower in her jaws and vibrates her wing muscles to dislodge pollen from the flower. Many plants, including a number of wildflowers and crops like tomatoes, peppers, and cranberries, benefit from buzz pollination.

Because they are essential pollinators, loss of bumble bees can have far ranging ecological consequences. Alarmingly, recent work by the Xerces Society in concert with IUCN Bumble Bee Specialist Group, indicates that some species have experienced rapid and dramatic declines more than others. In fact, more than one quarter (28%) of all North American bumble bees are facing some degree of extinction risk. While some species have received considerable conservation attention, other species such as the Suckley cuckoo bumble bee and the variable cuckoo bumble bee have been largely overlooked.

For information about our efforts to conserve the rusty patched bumble bee (Bombus affinis), please see its <a href='https://www.xerces.org/rusty-patched-bumble-bee/'>profile page</a> and check out this <a href='https://storymaps.arcgis.com/stories/c5e591a19eb24d28af483ede7b174434'>story map</a>.

## Data Access Python

#### Reading a CSV File into a DataFrame

Reading a CSV file into a DataFrame is the first step in data analysis using pandas. This task involves loading data from a CSV file into a pandas DataFrame, which provides a powerful and flexible data structure for data manipulation and analysis. The read_csv function is used to read the CSV file, making the data easily accessible for various operations such as filtering, grouping, and aggregating.

In [1]:
import pandas as pd

In [5]:
# Read the North American bumblebee CSV file into a DataFrame for easy data manipulation and analysis.
df1=pd.read_csv('/workspaces/myfolder/SASPythonDataScientists/pattern_decline__N_American_Bumblebees.csv' , encoding='latin-1')

  df1=pd.read_csv('/workspaces/myfolder/SASPythonDataScientists/pattern_decline__N_American_Bumblebees.csv' , encoding='latin-1')


In [None]:
https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte encoding=latin-1

In [7]:
# Read the Mexican bumblebee CSV file into a DataFrame for easy data manipulation and analysis.
df2=pd.read_csv('/workspaces/myfolder/SASPythonDataScientists/pattern_decline__Mexican_Bumblebees.csv' , encoding='latin-1')

#### Data Interoperability

In [None]:
df=pd.read_sas()

In [None]:
df1.describe

## Data Exploration

#### Get to know Python metadata

In [None]:
df1.describe

In [None]:
print(df1.dtypes)

In [None]:
df1.dtypes

##### Concatenating 2 data frames to combine North American(excluding Alaska) and Mexican Bumblebees

##### great explanation https://realpython.com/pandas-merge-join-and-concat/
https://pandas.pydata.org/docs/user_guide/merging.html

In [8]:
dfbig=pd.concat([df1,df2])

In [9]:
dfbig.describe

<bound method NDFrame.describe of            id institutionCode collectionCode      basisOfRecord  occurrenceID  \
0   699384987        USDA-ARS           BBSL  PreservedSpecimen     699384987   
1   699384988        USDA-ARS           BBSL  PreservedSpecimen     699384988   
2   699384989        USDA-ARS           BBSL  PreservedSpecimen     699384989   
3   699384990        USDA-ARS           BBSL  PreservedSpecimen     699384990   
4   699384991        USDA-ARS           BBSL  PreservedSpecimen     699384991   
..        ...             ...            ...                ...           ...   
19  767112267        USDA-ARS           BBSL  PreservedSpecimen     767112267   
20  767113738        USDA-ARS           BBSL  PreservedSpecimen     767113738   
21  767135957        USDA-ARS           BBSL  PreservedSpecimen     767135957   
22  767136283        USDA-ARS           BBSL  PreservedSpecimen     767136283   
23  767140079        USDA-ARS           BBSL  PreservedSpecimen     7671400

#### Drop columns from pandas dataframe

###### You can find out name of first column by using this command df.columns[0]. Indexing in python starts from 0.

In [None]:
cols=[2,3,4,5,6,7,8,9,10,11]
df.drop = (dfbig.columns[cols],axis=1)

#### Get to know data types

In [None]:
print(dfbig.dtypes)

When you import data into a Pandas DataFrame, Pandas by default tries to know the data types of each column. Columns with text are by default marked as Object datatype.

But Object dtype have a much broader scope. They can not only include strings, but also any other data that Pandas doesn't understand.

After Pandas 1.0 (now 1.1.2), there's a dedicated dtype to handle and work with text data, that is, String.🤔

How is this important?

When a column is Object type, it does not necessarily mean that all the values will be string.

In fact, they can all be numbers, or a mixture of string, integers and floats.

With this discrepancy present, you can not do any string operation on the column straightaway.

Moreover, having dtype as Object will make it less clear to work with just text and exclude the non-text values.

With the new String dtype, the values are explicitly treated as strings.

Convert the DataFrame to use best possible dtypes.

In [None]:
dfconv = dfbig.convert_dtypes()
dfconv.dtypes

#### Request descriptive statistics

In [None]:
print(dfconv.describe())

#### Request first 5 rows of data

In [None]:
print(dfconv.head())

In [None]:
print(dfconv.tail())

## Pattern Matching in Python

In [None]:
import regex as re

##  Filtering Rows Based on a Condition

Filtering rows based on a condition is a common data management task that allows you to focus on a specific subset of your data. By applying a condition to a column, such as selecting rows where the pollinator_genus is Bumblebee, you can isolate and analyze the data that meets your criteria. This helps in drawing insights and making data-driven decisions based on relevant data subsets.

In [None]:
# Filter rows where 
filtered_df = df.query('pollinator_genus == "Bombus"')
print(filtered_df)

print creates a basic looking output, use df.head etc. instead
python is object oriented, so every line of code uses an object.
following cell is indexing. df is the dataframe. df[1] selects first row, df[1,2] selects first row, 2nd col; df[0:10, 2]

In [None]:
df[df['pollinator_genus'] == 'Bombus']

In [None]:
df['pollinator_genus'] == 'Bombus'

In [None]:
sum(df['pollinator_genus'] == 'Bombus')


In [None]:
sum(~(df['pollinator_genus'] == 'Bombus'))

In [None]:
sum((df['pollinator_genus'] != 'Bombus'))

## Handling Missing Values Fill missing values with the column mean

Handling missing values is crucial for maintaining data integrity and ensuring accurate analysis. Missing data can be filled using various methods, such as replacing them with the column mean. This task involves using the fillna function to fill any NaN (Not a Number) values in the DataFrame with the mean of their respective columns, thereby preventing potential biases or errors in subsequent analyses.

In [None]:
df.fillna(df.mean(), inplace=True)
print(df)

## Applying a Function to Each Column

Applying a function to each column allows you to perform element-wise operations across the DataFrame. This task involves using the apply function with a lambda function to modify the values in each column. For example, multiplying each element by 2. This technique is useful for standardizing data, performing calculations, and transforming data values as needed.

In [None]:
# Apply a lambda function to each column
df = df.apply(lambda x: x*2)
print(df)

## Exporting DataFrame to CSV

In [None]:
# Export the DataFrame to a CSV file
df.to_csv('cleaned_data.csv', index=False)
print("DataFrame exported to cleaned_data.csv")

## Grouping Aggregating Data

In [None]:
4. Grouping Data and Calculating Aggregates
# Group by column 'country' and calculate the count of each group
grouped_df = dfconv.groupby("Country").sum()
print(grouped_df)

In [None]:
dfconv["country"].groupby("country").sum()

4. Count Frequency Value Using GroupBy.size()
df.groupby().size() function to get the count frequency of single or multiple columns,  Apply the size() function on the resulting Groupby() object to get a frequency count.

In [None]:
list(df1)

get a python list

In [None]:
df1.select_dtypes(include=['object']).columns

In [None]:
pd.crosstab(index=df1['column'])

Ari 20aug 24 frequency counts for all categorical variables
for loop-instead of looping over index; iterating thro categorical columns
column =goes thro each of the list elements
display - print -running crosstab method in pandas;

In [None]:
tables=[]
for column in df1.select_dtypes(include=['object']).columns:
    tables.append(pd.crosstab(index=df1[column], columns='number of observations'))

In [None]:
tables

In [None]:
for column in df1.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index=df1[column], columns='number of observations'))

In [None]:
for column in df1.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index=df1[column], columns='% observations', normalize='columns')*100)

In [None]:
dfcount=df1.groupby(list(df1)).size()
print("counts",dfcount)

In [None]:
dfcount=df1.groupby(['country']).size()
print("counts",dfcount)

In [None]:
dfcount=dfconv.groupby(['country']).size()
print("counts",dfcount)

The nunique() method returns the number of unique values for each column.

By specifying the column axis (axis='columns'), the nunique() method searches column-wise and returns the number of unique values for each row.

Syntax
dataframe.nunique(axis, dropna)

In [None]:
dfconv.nunique('columns')

In [None]:
diff types is the problem
dfconv.apply(lambda x: x.value_counts()).T.stack()

In [None]:
count = df1.groupby(['country', 'stateProvince', 'scientificName']).size() 
print(count) 

In [None]:
count = dfconv.groupby(['country', 'stateProvince', 'scientificName']).size() 
print(count) 

In [None]:
count = df['fruit'].value_counts()['apple']

print(f"The number of apples is: {count}")

## Merging data frames

We're diving into the world of bumblebees by buzzing through some data magic in Python! Imagine we've got one table that's packed with the scientific names of our favorite fuzzy pollinators, and another that's got their common names. By concatenating the scientific names into one tidy table and then merging it with the common names, we're basically creating the ultimate bee database—bringing together the formal and the familiar. It's like giving each bee its proper name tag at the hive party! This way, we can easily connect the dots between the Latin and the layman's terms, making our bumblebee data analysis as sweet as honey. 🐝💻

In [36]:
df3=pd.read_csv("/workspaces/myfolder/SASPythonDataScientists/Bumblebee_Scientific_Common_Names.csv", encoding='latin-1')

In [37]:
df3=pd.read_csv("https://raw.githubusercontent.com/CharuSAS/SASPythonDataScientists/main/Bumblebee_Scientific_Common_Names.csv", encoding='latin-1')

In [None]:
df3.describe

Take a quick look at the dimensions of the tables we are about to merge

In [18]:
df3.shape

(55, 3)

When working with our two bumblebee tables—one buzzing with scientific names and the other humming with common names—Python's merge() function is like a matchmaker for your data. The great thing about merge() is that it lets you decide exactly how these two tables come together. Say you want to merge them based on the ScientificName column, ensuring that each bee's formal identity pairs up perfectly with its everyday nickname. By using the on parameter, you can create the ultimate bee directory where the Latin meets the common, all while keeping your data as sharp as a bee's stinger! 🐝🔗

In [None]:
inner_merged = pd.merge(df1, df3, on=["SCIENTIFICNAME"])

 column names for dataframes are case sensitive.

Dataframe column names are essentially string values, which are case sensitive in Python. Because of this, you will need to be careful whenever you utilize column names, such as when renaming a column, accessing columns or performing functions on them.

In [25]:
df1.columns = df1.columns.str.lower()

In [40]:
df3.columns = df3.columns.str.lower()

In [41]:
df1.describe

<bound method NDFrame.describe of               id institutioncode collectioncode      basisofrecord  \
0      699384987        USDA-ARS           BBSL  PreservedSpecimen   
1      699384988        USDA-ARS           BBSL  PreservedSpecimen   
2      699384989        USDA-ARS           BBSL  PreservedSpecimen   
3      699384990        USDA-ARS           BBSL  PreservedSpecimen   
4      699384991        USDA-ARS           BBSL  PreservedSpecimen   
...          ...             ...            ...                ...   
66902  767151731        USDA-ARS           BBSL  PreservedSpecimen   
66903  767151732        USDA-ARS           BBSL  PreservedSpecimen   
66904  767151733        USDA-ARS           BBSL  PreservedSpecimen   
66905  767151734        USDA-ARS           BBSL  PreservedSpecimen   
66906  767151735        USDA-ARS           BBSL  PreservedSpecimen   

       occurrenceid catalognumber                recordedby    year  month  \
0         699384987    BBSL221088              

In [42]:
df3.describe

<bound method NDFrame.describe of           scientificname                       commonname  \
0                 Bombus                        Bumblebee   
1         Bombus affinis          Rusty Patched Bumblebee   
2       Bombus appositus       White-shouldered Bumblebee   
3         Bombus ashtoni          Ashton Cuckoo Bumblebee   
4         Bombus atratus                  Black Bumblebee   
5       Bombus auricomus         Black and Gold Bumblebee   
6       Bombus balteatus          Golden-belted Bumblebee   
7        Bombus bifarius               Two-form Bumblebee   
8     Bombus bimaculatus            Two-spotted Bumblebee   
9       Bombus bohemicus           Gypsy Cuckoo Bumblebee   
10       Bombus borealis         Northern Amber Bumblebee   
11    Bombus caliginosus                Obscure Bumblebee   
12      Bombus centralis                Central Bumblebee   
13       Bombus citrinus           Lemon Cuckoo Bumblebee   
14     Bombus cockerelli            Cockerells Bum

In [None]:
inner_merged.shape

### Merge df1 and df3 on column scientificname

In [43]:
inner_merged = pd.merge(df1, df3, on=["scientificname"])
print(inner_merged)

              id institutioncode collectioncode      basisofrecord  \
0      699384987        USDA-ARS           BBSL  PreservedSpecimen   
1      699384988        USDA-ARS           BBSL  PreservedSpecimen   
2      699384989        USDA-ARS           BBSL  PreservedSpecimen   
3      699384990        USDA-ARS           BBSL  PreservedSpecimen   
4      699384991        USDA-ARS           BBSL  PreservedSpecimen   
...          ...             ...            ...                ...   
66902  767151731        USDA-ARS           BBSL  PreservedSpecimen   
66903  767151732        USDA-ARS           BBSL  PreservedSpecimen   
66904  767151733        USDA-ARS           BBSL  PreservedSpecimen   
66905  767151734        USDA-ARS           BBSL  PreservedSpecimen   
66906  767151735        USDA-ARS           BBSL  PreservedSpecimen   

       occurrenceid catalognumber                recordedby    year  month  \
0         699384987    BBSL221088               W. Apperson  1970.0    7.0   
1  

In [None]:
pd.read_csv("https://github.com/CharuSAS/SASPythonDataScientists/blob/main/bumblebee%20scientific%20and%20common%20name.csv")

In [None]:
pd.read_csv("https://github.com/CharuSAS/SASPythonDataScientists/blob/main/bumblebee%20scientific%20and%20common%20name.csv")

## Pivoting data frames

In [None]:
# Pivot the DataFrame on 'index', 'columns', and 'values'
pivot_df = df.pivot(index='date', columns='category', values='value')
print(pivot_df)

## Resampling Time Series Data

In [None]:
# Resample time series data to monthly frequency
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
monthly_df = df.resample('M').sum()
print(monthly_df)

## Removing Duplicate Rows

In [None]:

# Remove duplicate rows based on all columns
df.drop_duplicates(inplace=True)
print(df)

## Saving imported file to workbench

In [None]:
import requests

# File path and name
file_path = r"/workspaces/myfolder/MachineLearning/hmeq.csv"
 
# Specify the URL of the CSV file
url = r"https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/hmeq.csv"
 
# Download the and save CSV file to Workbench
response = requests.get(url)
with open(file_path, 'wb') as f:
    f.write(response.content)
    print(f'File downloaded:{file_path}')