**1. Pick one of the datasets from the ChatBot session(s) of the TUT demo (or from your own ChatBot session if you wish) and use the code produced through the ChatBot interactions to import the data and confirm that the dataset has missing values.**

In [8]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv")
df.isna().sum()

row_n           0
id              1
name            0
gender          0
species         0
birthday        0
personality     0
song           11
phrase          0
full_id         0
url             0
dtype: int64

Therefore, for this data set, there are 11 missing values for the category song.

**2. Start a new ChatBot session with an initial prompt introducing the dataset you're using and request help to determine how many columns and rows of data a pandas DataFrame has, and then**

1. use code provided in your ChatBot session to print out the number of rows and columns of the dataset; and,

In [9]:
df.shape

(391, 11)

So there are 391 rows and 11 columns.

2. write your own general definitions of the meaning of "observations" and "variables" based on asking the ChatBot to explain these terms in the context of your dataset

Oberservations are rows in the dataset. In the case of the villager dataset I am using right now, it refers to the specific villager in the dataset, since that is what each row provides. Examples would be Admiral in row 0, Agent S in row 1, etc.

Variables are columns in the dataset. In the case of the villager dataset, each variable represents a characteristic of an observation. For example, the variable for gender provides the gender of each observation. So for Admiral, based on the dataset, it would male.

**3. Ask the ChatBot how you can provide simple summaries of the columns in the dataset and use the suggested code to provide these summaries for your dataset.**

In [11]:
df.describe()

Unnamed: 0,row_n
count,391.0
mean,239.902813
std,140.702672
min,2.0
25%,117.5
50%,240.0
75%,363.5
max,483.0


In [13]:
df['gender'].value_counts()

gender
male      204
female    187
Name: count, dtype: int64

In [14]:
df['name'].value_counts()

name
Admiral    1
Muffy      1
Paula      1
Patty      1
Pate       1
          ..
Elvis      1
Eloise     1
Elmer      1
Ellie      1
Zucker     1
Name: count, Length: 391, dtype: int64

In [15]:
df['species'].value_counts()

species
cat          23
rabbit       20
frog         18
squirrel     18
duck         17
dog          16
cub          16
pig          15
bear         15
mouse        15
horse        15
bird         13
penguin      13
sheep        13
elephant     11
wolf         11
ostrich      10
deer         10
eagle         9
gorilla       9
chicken       9
koala         9
goat          8
hamster       8
kangaroo      8
monkey        8
anteater      7
hippo         7
tiger         7
alligator     7
lion          7
bull          6
rhino         6
cow           4
octopus       3
Name: count, dtype: int64

**4. If the dataset you're using has (a) non-numeric variables and (b) missing values in numeric variables, explain (perhaps using help from a ChatBot if needed) the discrepancies between size of the dataset given by df.shape and what is reported by df.describe() with respect to (a) the number of columns it analyzes and (b) the values it reports in the "count" column.**

The .describe() method only summarizes numeric columns, so if a dataset contains non-numeric variables, they won’t be included in the output.
If a numeric column is entirely missing, it won't appear in the summary for .describe(), leading to fewer columns from .describe() than .shape.



**5. Use your ChatBot session to help understand the difference between the following and then provide your own paraphrasing summarization of that difference.**

⋅an "attribute", such as df.shape which does not end with ()

⋅and a "method", such as df.describe() which does end with ()

An attribute would just give some information or data about the dataset. For example, df.shape gives you information on the number of rows and columns of the dataset.

A method is a function that is associated with the dataset. It would perform calculations on the data, modify data, or return a new object. For example, df.describe would commpute a summary of the of statistics for the columns in the data. Sometimes you can put arguments inside the parentheses to give more instructions.

**6. The df.describe() method provides the 'count', 'mean', 'std', 'min', '25%', '50%', '75%', and 'max' summary statistics for each variable it analyzes. Give the definitions (perhaps using help from the ChatBot if needed) of each of these summary statistics**

Count - The number of non-null values.

Mean - The average value of the non-null data.

Std - The standard deviation.

Min - The minimum value.

25% - The 25th percentile (first quartile).

50% - The median (50th percentile).

75% - The 75th percentile (third quartile).

Max - The maximum value.

**7. Missing data can be considered "across rows" or "down columns". Consider how df.dropna() or del df['col'] should be applied to most efficiently use the available non-missing data in your dataset and briefly answer the following questions in your own words.**

1. Provide an example of a "use case" in which using df.dropna() might be peferred over using del df['col']

The df.dropna() function in Pandas is used to remove rows or columns that contain missing (null/NaN) values from a DataFrame.
The del df['col'] statement in Python is used to delete a specific column from a DataFrame. You provide the column name to delete it entirely, regardless of whether it contains null values or not.

So therefore, you’d choose df.dropna() when you want to clean your data by removing incomplete rows (or columns) while retaining useful information in other parts of the DataFrame.

2. Provide an example of "the opposite use case" in which using del df['col'] might be preferred over using df.dropna()

Del df['col'] might be preferred over using df.dropna() when you want to remove an entire column from the DataFrame, regardless of whether it contains null values or not.

3. Discuss why applying del df['col'] before df.dropna() when both are used together could be important

The reason to apply del df['col'] before df.dropna() is to prevent unnecessary row deletions. For example, if you have irrelevant columns that contain missing values, applying df.dropna() first could delete rows that have those missing values. This would lead to the potential deletion of important data in those rows.

4. Remove all missing data from one of the datasets you're considering using some combination of del df['col'] and/or df.dropna() and give a justification for your approach, including a "before and after" report of the results of your approach for your dataset.

Before:

In [3]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv")
df

Unnamed: 0,row_n,id,name,gender,species,birthday,personality,song,phrase,full_id,url
0,2,admiral,Admiral,male,bird,1-27,cranky,Steep Hill,aye aye,villager-admiral,https://villagerdb.com/images/villagers/thumb/...
1,3,agent-s,Agent S,female,squirrel,7-2,peppy,DJ K.K.,sidekick,villager-agent-s,https://villagerdb.com/images/villagers/thumb/...
2,4,agnes,Agnes,female,pig,4-21,uchi,K.K. House,snuffle,villager-agnes,https://villagerdb.com/images/villagers/thumb/...
3,6,al,Al,male,gorilla,10-18,lazy,Steep Hill,Ayyeeee,villager-al,https://villagerdb.com/images/villagers/thumb/...
4,7,alfonso,Alfonso,male,alligator,6-9,lazy,Forest Life,it'sa me,villager-alfonso,https://villagerdb.com/images/villagers/thumb/...
...,...,...,...,...,...,...,...,...,...,...,...
386,475,winnie,Winnie,female,horse,1-31,peppy,My Place,hay-OK,villager-winnie,https://villagerdb.com/images/villagers/thumb/...
387,477,wolfgang,Wolfgang,male,wolf,11-25,cranky,K.K. Song,snarrrl,villager-wolfgang,https://villagerdb.com/images/villagers/thumb/...
388,480,yuka,Yuka,female,koala,7-20,snooty,Soulful K.K.,tsk tsk,villager-yuka,https://villagerdb.com/images/villagers/thumb/...
389,481,zell,Zell,male,deer,6-7,smug,K.K. D&B,pronk,villager-zell,https://villagerdb.com/images/villagers/thumb/...


After:

In [4]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv")
df.dropna()

Unnamed: 0,row_n,id,name,gender,species,birthday,personality,song,phrase,full_id,url
0,2,admiral,Admiral,male,bird,1-27,cranky,Steep Hill,aye aye,villager-admiral,https://villagerdb.com/images/villagers/thumb/...
1,3,agent-s,Agent S,female,squirrel,7-2,peppy,DJ K.K.,sidekick,villager-agent-s,https://villagerdb.com/images/villagers/thumb/...
2,4,agnes,Agnes,female,pig,4-21,uchi,K.K. House,snuffle,villager-agnes,https://villagerdb.com/images/villagers/thumb/...
3,6,al,Al,male,gorilla,10-18,lazy,Steep Hill,Ayyeeee,villager-al,https://villagerdb.com/images/villagers/thumb/...
4,7,alfonso,Alfonso,male,alligator,6-9,lazy,Forest Life,it'sa me,villager-alfonso,https://villagerdb.com/images/villagers/thumb/...
...,...,...,...,...,...,...,...,...,...,...,...
386,475,winnie,Winnie,female,horse,1-31,peppy,My Place,hay-OK,villager-winnie,https://villagerdb.com/images/villagers/thumb/...
387,477,wolfgang,Wolfgang,male,wolf,11-25,cranky,K.K. Song,snarrrl,villager-wolfgang,https://villagerdb.com/images/villagers/thumb/...
388,480,yuka,Yuka,female,koala,7-20,snooty,Soulful K.K.,tsk tsk,villager-yuka,https://villagerdb.com/images/villagers/thumb/...
389,481,zell,Zell,male,deer,6-7,smug,K.K. D&B,pronk,villager-zell,https://villagerdb.com/images/villagers/thumb/...


**8. Give brief explanations in your own words for any requested answers to the questions below.**

1. Use your ChatBot session to understand what df.groupby("col1")["col2"].describe() does and then demonstrate and explain this using a different example from the "titanic" data set other than what the ChatBot automatically provide for you

df.groupby("col1") splits your data into groups based on the values in column 'col1'. Then, ["col2"].describe() will show stats for the each of the splitted groups.

E.g.

In [17]:
data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


After df.groupby("col1")["col2"].describe()

In [23]:
data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
data.groupby("species")["sepal_width"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,3.428,0.379064,2.3,3.2,3.4,3.675,4.4
versicolor,50.0,2.77,0.313798,2.0,2.525,2.8,3.0,3.4
virginica,50.0,2.974,0.322497,2.2,2.8,3.0,3.175,3.8


2. Assuming you've not yet removed missing values in the manner of question "7" above, df.describe() would have different values in the count value for different data columns depending on the missingness present in the original data. Why do these capture something fundamentally different from the values in the count that result from doing something like df.groupby("col1")["col2"].describe()?

The count in df.describe() shows the number of non-missing (non-NaN) values in each column throughout the whole DataFrame. The count  df.groupby("col1")["col2"] shows the number of non-missing (non-NaN) values in "col2" for each group that "col1" defines.

3. Intentionally introduce the following errors into your code and report your opinion as to whether it's easier to (a) work in a ChatBot session to fix the errors, or (b) use google to search for and fix errors: first share the errors you get in the ChatBot session and see if you can work with ChatBot to troubleshoot and fix the coding errors, and then see if you think a google search for the error provides the necessary toubleshooting help more quickly than ChatGPT.

   A. Forget to include import pandas as pd in your code

The ChatBot noticed the error and pointed it out while google search gave me GitHub for STA130.

B. Mistype "titanic.csv" as "titanics.csv"

The ChatBot just gave me data and google autocorrected the code to titanic.csv.

C. Try to use a dataframe before it's been assigned into the variable

The ChatBot would just asssume that the url is the dataframe and run the code for me. Google would not tell me anything.

D. Forget one of the parentheses somewhere the code

When I forget the parentheses in the code for the ChatBot, it would still just run the code, but when I say to fix it, it would tell me the error. Google would auto correct and add the parentheses.

E. Mistype one of the names of the chained functions with the code

The ChatBot would still run the code and Google would auto correct the code for me.

F. Use a column name that's not in your data for the groupby and column selection

Depending on the column name it would run it for the ChatBot. If the column name that is wrong is close enough to the right one, the code would run, but if it not, the ChatBot would give me suggestions. Google would not tell me anything.

G. Forget to put the column name as a string in quotes for the groupby and column selection, and see if the ChatBot and google are still as helpful as they were for the previous question

The ChatBot would still run the code assuming that the quotes exist. Google would tell me that the quotes are missing.

**9. Have you reviewed the course wiki-textbook and interacted with a ChatBot (or, if that wasn't sufficient, real people in the course piazza discussion board or TA office hours) to help you understand all the material in the tutorial and lecture that you didn't quite follow when you first saw it?**

Yes.

Question 1: https://chatgpt.com/share/66f28e19-1528-8001-9a60-9fa034fc74df

Question 2, 3, 4, 5, 7: https://chatgpt.com/share/66f28f2a-c42c-8001-97dc-f4abb1521cb8

Question 6, 8: https://chatgpt.com/share/66f28f8e-c3d4-8001-8a4b-af243b63982a