In [1]:
# 1. Pick one of the datasets from the ChatBot session(s) of the TUT demo (or from your own ChatBot session if you wish) and use the code produced through the ChatBot interactions to import the data and confirm that the dataset has missing values

import pandas as pd

url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv"
df = pd.read_csv(url)
print(df.isna().sum())

row_n           0
id              1
name            0
gender          0
species         0
birthday        0
personality     0
song           11
phrase          0
full_id         0
url             0
dtype: int64


In [2]:
# 2. Start a new ChatBot session with an initial prompt introducing the dataset you're using and request help to determine how many columns and rows of data a pandas DataFrame has, and then
# How do I find the number of rows and columns in a pandas database?

num_rows, num_cols = df.shape
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)

# 2.2 
# Observations are individual entries (or rows) in a database, these characterize one subject of a database.
# For example, in a business's database, an example of an observation would be an individual customer

# Variables are characteristics of entries in a database (columns) these provide information for an attribute across multiple entries in a database
# For example, that same business might store a customer's AGE or NAME as variables

Number of rows: 391
Number of columns: 11


In [4]:
# 3. Ask the ChatBot how you can provide simple summaries of the columns in the dataset and use the suggested code to provide these summaries for your dataset
# How can I use df.describe() and df['column'].value_counts() to figure out the information of a given database

print(df.describe())
print(df.describe(include='object'))
print("Column names:", [col for col in df.columns])

for column in df.columns:
    print(df[column].value_counts())

print()

print("Missing Over Rows: ")
missing_data_per_row = df.isna().sum(axis=0)
print(missing_data_per_row)

            row_n
count  391.000000
mean   239.902813
std    140.702672
min      2.000000
25%    117.500000
50%    240.000000
75%    363.500000
max    483.000000
             id     name gender species birthday personality          song  \
count       390      391    391     391      391         391           380   
unique      390      391      2      35      361           8            92   
top     admiral  Admiral   male     cat     1-27        lazy  K.K. Country   
freq          1        1    204      23        2          60            10   

         phrase           full_id  \
count       391               391   
unique      388               391   
top     wee one  villager-admiral   
freq          2                 1   

                                                      url  
count                                                 391  
unique                                                391  
top     https://villagerdb.com/images/villagers/thumb/...  
freq                 

In [5]:
# What does df.shape and df.describe() do and what are the differences between them?
print(df.shape)
print(df['song'].describe())
# df.shape vs df.describe()
# 391 != 380

# df.describe(): Requires specifying include='object' or include='all' to get counts of unique values, most frequent values, and their frequencies.
# df.describe(): Misses missing values and null values
# df.shape counts all non-numeric and missing values by default

# Switching Databases
new_url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
new_df = pd.read_csv(new_url)


(391, 11)
count              380
unique              92
top       K.K. Country
freq                10
Name: song, dtype: object


In [None]:
# 5. Use your ChatBot session to help understand the difference between the following and then provide your own paraphrasing summarization of that difference
# What are the key differences between an attribute such as df.shape versus a method such as df.describe() with brackets?

# An attribute such as df.shape is a descriptor of an object, in this case df.
# The attribute can be accessed without brackets and usually provides info about a particular data structure.

# A method uses brackets to take (sometimes optional) arguments to execute a function pertaining to an object.
# df.describe() will execute a script describing df with optional arguments going into the brackets

In [6]:
# 6. The df.describe() method provides the 'count', 'mean', 'std', 'min', '25%', '50%', '75%', and 'max' summary statistics for each variable it analyzes. Give the definitions (perhaps using help from the ChatBot if needed) of each of these summary statistics

# Sample DataFrame
sample_df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
})

# count: The number of non-null (non-missing) values in each column.
print("Count:")
print(sample_df.count())
# Output:
# A    5
# B    5
# dtype: int64

# mean: The average value of the data in each column.
print("\nMean:")
print(sample_df.mean())
# Output:
# A     3.0
# B    30.0
# dtype: float64

# std: The standard deviation of the data in each column.
print("\nStandard Deviation:")
print(sample_df.std())
# Output:
# A     1.581139
# B    15.811388
# dtype: float64

# min: The minimum value in each column.
print("\nMinimum:")
print(sample_df.min())
# Output:
# A     1
# B    10
# dtype: int64

# 25%: The 25th percentile (first quartile) value in each column.
print("\n25th Percentile:")
print(sample_df.quantile(0.25))
# Output:
# A     2.0
# B    20.0
# dtype: float64

# 50%: The 50th percentile (median) value in each column.
print("\n50th Percentile (Median):")
print(sample_df.median())
# Output:
# A     3.0
# B    30.0
# dtype: float64

# 75%: The 75th percentile (third quartile) value in each column.
print("\n75th Percentile:")
print(sample_df.quantile(0.75))
# Output:
# A     4.0
# B    40.0
# dtype: float64

# max: The maximum value in each column.
print("\nMaximum:")
print(sample_df.max())
# Output:
# A     5
# B    50
# dtype: int64


Count:
A    5
B    5
dtype: int64

Mean:
A     3.0
B    30.0
dtype: float64

Standard Deviation:
A     1.581139
B    15.811388
dtype: float64

Minimum:
A     1
B    10
dtype: int64

25th Percentile:
A     2.0
B    20.0
Name: 0.25, dtype: float64

50th Percentile (Median):
A     3.0
B    30.0
dtype: float64

75th Percentile:
A     4.0
B    40.0
Name: 0.75, dtype: float64

Maximum:
A     5
B    50
dtype: int64


In [23]:
# 7. Missing data can be considered "across rows" or "down columns". Consider how df.dropna() or del df['col'] should be applied to most efficiently use the available non-missing data in your dataset and briefly answer the following questions in your own words

# Provide an example of a "use case" in which using df.dropna() might be peferred over using del df['col']
# This may be preferred if the majority of your data has value and only a few entries are scattered with null values.
# This is because in larger databases that are mostly complete, you can lose as little data as possible

# Provide an example of "the opposite use case" in which using del df['col'] might be preferred over using df.dropna()
# If certain categories are missing the majority of their values, then it's better to sacrifice one attribute to get rid of the non-values.

# Discuss why applying del df['col'] before df.dropna() when both are used together could be important
# A combination of both are required if you have columns with mostly non-null values and if you have multiple fragmented row entries

# Remove all missing data from one of the datasets you're considering using some combination of del df['col'] and/or df.dropna() and give a justification for your approach, including a "before and after" report of the results of your approach for your dataset.

null_counts = new_df.isnull().sum()
# print(null_counts)

# 177 missing age entries
# 688 missing deck entries
# del both cols

# del new_df['age']
# del new_df['deck']

# Then cleanup any additional null rows
cleaned_new_df = new_df.dropna()
print(cleaned_new_df.describe())
print(cleaned_new_df.isnull().sum())

         survived      pclass       sibsp       parch        fare
count  889.000000  889.000000  889.000000  889.000000  889.000000
mean     0.382452    2.311586    0.524184    0.382452   32.096681
std      0.486260    0.834700    1.103705    0.806761   49.697504
min      0.000000    1.000000    0.000000    0.000000    0.000000
25%      0.000000    2.000000    0.000000    0.000000    7.895800
50%      0.000000    3.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000    8.000000    6.000000  512.329200
survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64


In [34]:
# 8. Give brief explanations in your own words for any requested answers to the questions below

# The groupby() function takes two attributes and compares them to one another using unique instances of the first argument
# These instances are then compared by the second argument, then does basic statistical analysis on each row.
# This function takes every unique instance of 'sex' and finds individual stats for 'fare' for each of these instances
# In this case, we can see the mean, stdev, min, max, of the fares comparitive from men and women 
print(cleaned_new_df.groupby("sex")["fare"].describe())

# The count values from df.describe() and df.groupby("col1")["col2"].describe() capture different aspects of the data, showing different contexts.
# df.describe() shows the count of all non-null values with a total sum in the "count" statistic
# df.groupby("col1")["col2"].describe() Will display a count of 0 on specific null rows to show that there was a specific entry which had a null entry instead of just an exclusion of nulls.

# Intentionally introduce the following errors into your code and report your opinion as to whether it's easier to (a) work in a ChatBot session to fix the errors, or (b) use google to search for and fix errors: first share the errors you get in the ChatBot session and see if you can work with ChatBot to troubleshoot and fix the coding errors, and then see if you think a google search for the error provides the necessary toubleshooting help more quickly than ChatGPT
# Since I've programmed before and know what to look for, personally I find it easier to just google errors, since I know mostly what I am looking for.
# I find it nicer to Google things because I can see perspectives from different professional opinions whereas AI will give me one 'factual' answer from one standpoint
# For a beginner, I see why having a streamlined process for debugging is extremely useful to have. AI is certainly a viable option for many people out there.

# 9. Have you reviewed the course wiki-textbook and interacted with a ChatBot (or, if that wasn't sufficient, real people in the course piazza discussion board or TA office hours) to help you understand all the material in the tutorial and lecture that you didn't quite follow when you first saw it?
# No

        count       mean        std   min        25%   50%     75%       max
sex                                                                         
female  312.0  44.252124  58.113672  6.75  11.810425  23.0  53.575  512.3292
male    577.0  25.523893  43.138263  0.00   7.895800  10.5  26.550  512.3292
