In [None]:
# 1

In [160]:
import pandas as pd

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)

df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [None]:
# 2 

In [152]:
# Get the number of rows and columns
rows, columns = df.shape

f"The DataFrame has {rows} rows and {columns} columns."

'The DataFrame has 891 rows and 15 columns.'

In [None]:
"""
# Every observation refers to a single row in a dataset, 
# they consists of data values for a number of variables. 
# Observations are records that are gathered and recoreded 
# for data purposes. 

# Variables correspond to columns in the dataset and 
# represent the traits or qualities measured 
# or noted on an observation. Variables can be classified 
# into quantitative (numeric) or qualitative (non-numerci) 
# categories.
"""

In [None]:
# 3

In [30]:
# Summary of numerical columns. (Notice how age is missing columns and it is reflected in the count)
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [32]:
# Return the count of each unique values in a non-numeric column.
df['embark_town'].value_counts()

embark_town
Southampton    644
Cherbourg      168
Queenstown      77
Name: count, dtype: int64

In [34]:
# Loop through all the columns, 
# print the counts for each columns unique values.
for column in df.columns:
    print(f"Value counts for column '{column}':")
    print(df[column].value_counts())
    print("\n" + "="*50 + "\n")  # Separator for better readability

Value counts for column 'survived':
survived
0    549
1    342
Name: count, dtype: int64


Value counts for column 'pclass':
pclass
3    491
1    216
2    184
Name: count, dtype: int64


Value counts for column 'sex':
sex
male      577
female    314
Name: count, dtype: int64


Value counts for column 'age':
age
24.00    30
22.00    27
18.00    26
19.00    25
28.00    25
         ..
36.50     1
55.50     1
0.92      1
23.50     1
74.00     1
Name: count, Length: 88, dtype: int64


Value counts for column 'sibsp':
sibsp
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: count, dtype: int64


Value counts for column 'parch':
parch
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: count, dtype: int64


Value counts for column 'fare':
fare
8.0500     43
13.0000    42
7.8958     38
7.7500     34
26.0000    31
           ..
35.0000     1
28.5000     1
6.2375      1
14.0000     1
10.5167     1
Name: count, Length: 248, dtype: int64


Value counts for column

In [None]:
# 4

In [None]:
"""
df.shape() is used to return the total number of rows
and columns regardless if they're numeric, non-numeric 
or missing values. It returns a tuple with the format 
(number_of_rows, number_of_columns)

df.describe() provides summary statistics for numerical 
columns, ignoring missing values (NaN) and displaying the count 
for non-missing values.

As a result, the 'count' in df.describe() may differ from
the 'count' shown in df.shape(), since missing values in 
numeric columns cause the 'count' to be less than the total
row count respectively.
"""

In [None]:
# 5

In [None]:
"""
An "attribute", such as df.shape, which does not end with (), 
are variable that are associated with an object or class. It 
represents the state or properties of that object and can be
provided to the user without doing any extra work or calculations.

Methods, such as df.describe(), which end with (), is a block of 
code that performs actions or operations on an object. Methods 
are similar to functions, but they are associated with an object. 
When a method is called, the object it is invoked on is 
automatically passed as an argument to the method, and you can 
also include additional arguments within the parentheses.
"""

In [None]:
# 6

In [None]:
"""
Description of the statistics provided by df.describe().

count: Displays the count of non-missing observations 
in the column.

mean: adding up all the data values and dividing the 
sum by the total number of values.

std Deviation: A measure of spread, and is related 
to the mean as the measure of central tendency. 
The mean gives us the middle value and then the 
variance and standard deviation is calculated. 
A higher standard deviation means more spread out 
values.

min: The smallest value in the column represents the 
lowest observed data points. 

25%: Also known as the first quartile, is a measure 
that indicates the value below which 25% of the data 
points fall. 

50%: Also known as the median, it is the value in the 
middle of the data, when the data values are sorted 
from smallest to largest. It divides the data into two 
equal halves, where 50% of the data points are below and
50% are above. 

75%: The third quartile, it is the data value which is
greater than 75% of the data values. It provides insight 
into the higher end of the data distribution. 

max: The largest observed data value in that column. 
"""

In [None]:
# 7

In [60]:
"""
df.dropna() removes rows that contain at least one 
missing value. 

df.dropna(subset=['col1', 'col2']) to remove rows 
with missing values in specific columns.

del df['col'] Deletes an entire column from the 
DataFrame. Used when you determine a column is 
not useful for analysis because it has to many 
missing values or is irrelevant. 

#  1  

    Consider a dataset where each observation is a
review on a particular item and one column contains
ratings. If a few ratings are missing, using df.dropna()
to remove the rows with missing ratings rather than 
removing the whole column is preferred because the 
rating column is important to understanding customer
feedback. Retaining the column and focusing on complete
rows will provide more accurate and meaningful 
reviews, even if it means working with less data.

#  2

    Consider a dataset for customer reviews with a column
for reviewer_age. If the reviewer_age column has a large 
number of missing values and it is not critical for understanding
customer feedback, using del col['reviewer_age'] might be 
preferred over df.dropna(). Removing this column simplifies
the dataset and focuses your analysis on other columns like
ratings.

#  3 

    It is important to remove a column with a substantial 
amount of missing values before using df.dropna() because
if not first removed, the method will remove a large 
number of rows containing valuable data. By deleting a 
problematic column first, you ensure that df.dropna()
only affects the remaining columns, thus preserving more
relevant data and improving the efficiency of the data 
cleaning process. 

#  4 

    To remove all missing data from a dataset appropriately 
first, use df.isnull().sum to check all the columns for 
missing data. If a column has a substantial amount of 
missing data, remove that column using del df['col']. 
Then, decide whether you should use df.dropna() to remove 
the observations that countain missing data. However, before 
deleteing a column debate the importance of that column
to your analysis. 

"""

"\ndf.dropna() removes rows that contain at least one \nmissing value. \n\ndf.dropna(subset=['col1', 'col2']) to remove rows \nwith missing values in specific columns.\n\ndel df['col'] Deletes an entire column from the \nDataFrame. Used when you determine a column is \nnot useful for analysis because it has to many \nmissing values or is irrelevant. \n\n#  1  \n\n    Consider a dataset where each observation is a\nreview on a particular item and one column contains\nratings. If a few ratings are missing, using df.dropna()\nto remove the rows with missing ratings rather than \nremoving the whole column is preferred because the \nrating column is important to understanding customer\nfeedback. Retaining the column and focusing on complete\nrows will provide more accurate and meaningful \nreviews, even if it means working with less data.\n\n#  2\n\n    Consider a dataset for customer reviews with a column\nfor reviewer_age. If the reviewer_age column has a large \nnumber of missing values 

In [162]:

df_copy = df.copy()

del df_copy['deck']

del df_copy['age']

df_copy.dropna(inplace=True)

print("Missing values count")
print(df_copy.isnull().sum())

rows_copy, columns_copy = df_copy.shape

f"The DataFrame has {rows} rows and {columns} columns."


Missing values count
survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64


'The DataFrame has 891 rows and 15 columns.'

In [156]:
# 8 

In [166]:
# Make sure we are working with the original data 
print("Missing values count")
print(df.isnull().sum())

rows_copy, columns_copy = df.shape

f"The DataFrame has {rows} rows and {columns} columns."

Missing values count
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


'The DataFrame has 891 rows and 15 columns.'

In [54]:
#  8. (1)
# Group the data into the unique values found in column 'who'
# Which happen to be 'man', 'woman' and 'child.
# Do summary analysis on every individual group. 

# Return grouped summary analysis of 'fare' based on the 
# unique groups of column 'who'.
df.groupby("who")["fare"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
who,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
child,83.0,32.785795,33.466438,7.225,15.9,26.25,32.19375,211.3375
man,537.0,24.864182,44.021339,0.0,7.8542,9.5,26.3875,512.3292
woman,271.0,46.570711,60.318995,6.75,10.5,23.25,65.0,512.3292


In [172]:
# 8. (2)

df.describe()

# Note that 83.0 + 537.0 + 271.0 = 891.0 

"""
The count returned by df.groupby("who")["fare"].describe()
represents the number of non-missing values returned for 
each unique group in the column "who". Whereas the count 
returned by df.describe() provides you the number of 
non-missing values in each column across the DataFrame.
"""

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [25]:
#  8. (3)

import pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)

# a)
# The Chat bot instantly told me to import pandas as pd.
# After copying and pasting the same error into google
# the first was result I clicked was a stackoverflow 
# website that told me to install seaborn. 

# b)
# The chatbot knew there was an issue with the path and
# provided the right path, possibly because of our 
# previous chat. The first google result is a stackoverflow
# site which mentions using a direct path, but can not 
# provide me the exactly url to the file. 

# c)
# Chatgpt tells me that DF has not been defined or assigned a
# DataFrame in my code and provides the code to fix it. 
# Google tells me to make sure my DataFrame is declared 
# before access. 

# d)
# Chatbot tells me that I am likely missing a closing parenthesis.
# Google told me to make sure the code lines up with the proper 
# indentation level. 

# e)
# Chatgpt instantly told me the code I needed to use. 
# The first google result told me the correct function name.
# https://www.reddit.com/r/learnpython/comments/80rrvr/pandas_attributeerror_dataframe_object_has_no/
# Google search: AttributeError: 'DataFrame' object has no attribute 'group_by'

# f)
# Chatbot told me that the column doesn't exist, provided 
# me the code to check the columns in the DataFrame and 
# also provided me the code I was looking for. 
# It took a long time to find the problem using Gooogle. 

# g) Chatbot mention there was an error in my code with the 
# column name fare and provided me the code to fix it. 
# Google didn't find the exact problem. But It found similar problems.
# Using google would have taken me alot more longer to solve. 
                 



In [29]:
# df.groupby("Who")[fare].describe()

KeyError: 'Who'

In [None]:
# 9 
# Yes

## Chat Log History 1 - 7 

https://chatgpt.com/share/bf9a645f-9d61-49f0-879a-d06a324cd2b0

## Summary 

### 1

Summary of Exchanges
1. Understanding DataFrame Attributes and Methods:

Attributes vs. Methods:
Attributes: Characteristics of a DataFrame that provide direct information without performing actions. For example, df.shape is an attribute that returns the dimensions (number of rows and columns) of the DataFrame.
Methods: Functions linked to a DataFrame that perform specific operations. Methods require parentheses because they can accept arguments and execute actions or calculations. For instance, df.describe() is a method that generates summary statistics for numeric columns.
Parentheses in Methods:
The parentheses are used to pass additional arguments to the method, allowing it to customize its behavior. For example, df.describe(include='all') includes both numeric and non-numeric columns in the summary.
2. Dataset Size and Summary Differences:

df.shape:
Provides the overall dimensions of the DataFrame, reporting the total number of rows and columns, including all columns regardless of type.
df.describe():
By default, summarizes only numeric columns, providing statistics like count, mean, and standard deviation.
If include='all' is specified, it includes non-numeric columns and provides counts of unique values, but the 'count' row shows non-missing entries only for each column.
3. Difference Between Mathematical Functions and Programming Functions:

Mathematical Functions:
Define static relations between inputs and outputs (e.g., f(x) = x^2).
Programming Functions:
Blocks of code designed to perform tasks, potentially modifying objects or interacting with other functions. Methods in programming are a type of function linked to objects that perform operations on them.
4. Rephrased Explanation:

A method is a function associated with a DataFrame object that performs operations on that DataFrame. Parentheses are necessary to accept additional arguments and customize behavior. For instance, calling .describe() uses the DataFrame as the context, while the parentheses allow for optional parameters to adjust the output.

### 2

Summary of Data Cleaning and Analysis Discussion:

We discussed how to effectively handle missing data in a dataset using methods like df.dropna() and del df['col'] in Python with pandas. The focus was on deciding when to remove rows versus when to remove entire columns based on the importance of the data and the extent of missing values.

Handling Missing Data Across Rows vs. Down Columns:
df.dropna() removes rows or columns with missing values, which is useful when you want to preserve the structure of critical data while removing incomplete records.
del df['col'] deletes an entire column and is useful when a column has excessive missing values or is not crucial for the analysis.
Use Case for df.dropna():
In a dataset of customer reviews, keeping the Sentiment_Score column intact while removing rows with missing scores ensures that valuable data for analysis is not lost, which is preferable over deleting the entire sentiment column.
Use Case for del df['col']:
In cases where a column, like Reviewer_Age, has a significant amount of missing data and is not essential to the analysis, removing the entire column (del df['Reviewer_Age']) can streamline the dataset and focus on more relevant data.
Order of Operations - del Before df.dropna():
It's important to remove columns with substantial missing values before applying df.dropna(). This prevents unnecessary loss of rows that are otherwise complete in critical columns, ensuring that the dataset retains as much valuable information as possible.
Practical Application:
For an example dataset, we removed a column (Reviewer_Age) with a large number of missing values using del, followed by using df.dropna() to clean the remaining rows. The "before and after" report showed a reduction in missing data and an efficient cleanup of the dataset, preserving important information for analysis.
This conversation emphasized the importance of strategically handling missing data to maximize the quality and utility of a dataset in data analysis tasks.

## Chat Log History 8

https://chatgpt.com/share/e392948c-9b5f-4970-bd30-8a39d6744520

## Summary 

Summary of Interactions
1. Loading and Inspecting Data

Initial Issue: You encountered a NameError indicating that pd was not defined.
Resolution: The error was due to missing the import statement for pandas. Adding import pandas as pd resolved the issue.
2. Handling Missing Values

Task: You examined missing values in the Titanic dataset.
Code Used: df.isnull().sum() was used to check for missing values.
Explanation: This method returns the count of missing values for each column in the DataFrame.
3. Understanding Data Description

Question: You asked about the difference between df.describe() and df.groupby("col1")["col2"].describe().
Explanation: df.describe() provides summary statistics for each column, considering missing values. df.groupby("col1")["col2"].describe() provides statistics for each group defined by col1, and it can handle missing values in a group-specific manner.
4. Fixing a FileNotFoundError

Issue: You encountered a FileNotFoundError because the file path was incorrect.
Resolution: Verified the file name and path, ensuring they matched the actual file location.
5. Correcting SyntaxError

Issue: You faced a SyntaxError due to incomplete input.
Resolution: Ensured all code statements were properly closed and complete, and verified proper syntax.
6. Fixing AttributeError

Issue: You received an AttributeError for using group_by instead of groupby.
Resolution: Replaced group_by with groupby to correctly group the DataFrame.
7. Fixing KeyError

Issue: You encountered a KeyError due to using an incorrect column name and incorrect syntax in df.groupby("Who")[fare].describe().
Resolution: Verified the correct column name and corrected the syntax to use "fare" and the actual column name.
Feel free to use this summary for your homework assignment! If you need any more details or have additional questions, just let me know.