In [None]:
'''Summary of the Conversation:
Dataset Overview: You asked about how to inspect the columns and size of a dataset using df.shape and df.describe(). We 
discussed that df.shape provides the dimensions (number of rows and columns), while df.describe() gives summary statistics 
like mean, min, max, and quartiles for numeric columns.

Attributes vs Methods: We clarified the difference between an attribute and a method in pandas:

Attribute: e.g., df.shape (no parentheses) gives information about the DataFrame without performing any action.
Method: e.g., df.describe() (with parentheses) performs an operation and returns a result, in this case, summary statistics.
Handling Missing Data: You wanted to know how to efficiently remove missing data:

First, identify columns with many missing values and delete them using del df['col'].
Then, use df.dropna() to remove rows with missing values, ensuring you retain useful data.
We discussed the importance of deleting irrelevant columns before removing rows to avoid over-deleting data.
Deleting Columns with Missing Data: I provided code for deleting specific columns with missing values:

You can use df.isnull().sum() to find out how many missing values each column has.
To remove columns with missing values using del df[], you can loop through those columns and delete them one by one.
Runnable Code Example: We created a runnable code snippet that first identifies columns with missing values and then deletes
those columns using del df[col]. This approach ensures that columns with missing values are efficiently removed without 
needing other functions like df.drop().
'''

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv"
df = pd.read_csv(url)
df.isna().sum()


row_n           0
id              1
name            0
gender          0
species         0
birthday        0
personality     0
song           11
phrase          0
full_id         0
url             0
dtype: int64

In [3]:
rows, columns = df.shape

# Print the result
print(f"The dataset has {rows} rows and {columns} columns.")

The dataset has 391 rows and 11 columns.


In [10]:

# Statistical summary of numerical columns
df.describe()

# Display the first few rows to understand the data better


Unnamed: 0,row_n
count,391.0
mean,239.902813
std,140.702672
min,2.0
25%,117.5
50%,240.0
75%,363.5
max,483.0


In [None]:
''' .shape gives the total number of rows and columns in the dataset, regardless of the type of data in those columns.
    .describe() only provides statistical summaries for numeric columns. 
    .shape: The number of rows reported by shape includes all rows, regardless of whether they contain missing values or not. 
Even if a row has missing data in some numeric columns, it will still be counted as part of the total number of rows.
    The count in describe() only refers to the non-null values
'''


In [None]:
'''
Attribute:
Does not end with ()
Returns a property or data
Typically static information
Accessed like a variable

Method:
Ends with ()
Performs an action
May involve computation or changes
Called like a function   '''

In [None]:
'''
count: you how many valid entries exist for that particular column.
mean: average of a group of values, calculated by adding value up and dividing by the number of terms
std: The square root of the variance, where variance is the average of the squared differences from the mean.
min: smallest term
25%: First 25% of the data fell below this value
50%: The middle value when arrange the data set in order
75%: Last 25% of the data fell above this value
max: The maximum value of this data set
'''


In [6]:
'''
1. use when most missing datas are concentrated in a few rows, not when they are scattered aross different columns
2. when most missing datas are in a few columns, removing a few row would remove most of missing data
3. Losing a column of datas likely won't have significant effects of the data, as it represent one characteristic of the datas,
   however, losing a row means losing an individual and could potentially decrease the reliability of the data set (decrease in n)
   
4. Identify Columns with Significant Missing Data
   Remove Columns with High Missing Values
   Remove Rows with Remaining Missing Values
   
'''
import pandas as pd
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv"
df = pd.read_csv(url)

# before
missing_values = df.isnull().sum()
print(df.columns)
print("Columns with missing data:")
print(missing_values[missing_values > 0])

columns_with_missing_data = missing_values[missing_values > 0].index

for col in columns_with_missing_data:
    del df[col]
    
print("Columns after deletion:")
print(df.columns)

Index(['row_n', 'id', 'name', 'gender', 'species', 'birthday', 'personality',
       'song', 'phrase', 'full_id', 'url'],
      dtype='object')
Columns with missing data:
id       1
song    11
dtype: int64
Columns after deletion:
Index(['row_n', 'name', 'gender', 'species', 'birthday', 'personality',
       'phrase', 'full_id', 'url'],
      dtype='object')


In [14]:
'''The first part ''df.groupby("col1")'' groups the DataFrame by every unique values in column 1, it creates many subsets each
representing the datas following a specfic value in column 1
   The second part["col2"].describe() will summarize the datas in column 2, but only base on the subsets created previously 
based column 1. 
'''
import pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)
DEADvsNOTDEAD = df.groupby('survived')["age"].describe()
print(DEADvsNOTDEAD)

NameError: name 'survived' is not defined

In [None]:
'''
df.describe() gives a single count for each column, based on non-missing values across the entire dataset.
df.groupby("col1")["col2"].describe() gives a count for each unique group in col1, showing how many non-missing col2 values
are present within each group.

'''

In [12]:
'''
Intentially introducing errors to the code in order to get a response

1. Forget to include import pandas as pd in your code
        Chatgpt provided valid solutions quickly and efficiently, it provided instructions on how to improve it and is 
        easier to understand.
        
2. mistype 'titanic.csv'
        Chatgpt is still a more efficient tool in this case, base on my previous conversations with chatgpt, it located my
        errors quickly and provided me a solution that I can use.
        
3. Try to use a dataframe before it's been assigned into the variable
        Again, since chatgpt have the access to understand my actual code, it can locate the questions and return a answer 
        back. However, using google search is extremely hard because how vague the error code is. 
        
4. Forget one of the parentheses somewhere the code
        For this error, google search provided me a clear and correct answer to the error message. Chatgpt also responded,
        but instead of explaining the error, it tried to rewrite the code and created unnesscessary processes.

5. Mistype one of the names of the chained functions with the code
        Chatgpt successfully identified my issue, took long to search in google searches
        
6. Use a column name that's not in your data for the groupby and column selection
        Chatgpt quickly identified the error and provide me an example about how to solve it. Where as it takes a while for
        me to find the desirable result from google search
        
7. Forget to put the column name as a string in quotes for the groupby and column selection
        Chatgpt correctly recognized the issues after I spend in the entire error message. I struggled to find answer in
        google search, because its too much to paste the entire error message, but to short for any useful info if I only
        put the last line
'''


"\nIntentially introducing errors to the code in order to get a response\n\n1. Forget to include import pandas as pd in your code\n        Chatgpt provided valid solutions quickly and efficiently, it provided instructions on how to improve it and is \n        easier to understand.\n        \n2. mistype 'titanic.csv'\n        Chatgpt is still a more efficient tool in this case, base on my previous conversations with chatgpt, it located my\n        errors quickly and provided me a solution that I can use.\n        \n3. Try to use a dataframe before it's been assigned into the variable\n        Again, since chatgpt have the access to understand my actual code, it can locate the questions and return a answer \n        back. However, using google search is extremely hard because how vague the error code is. \n        \n4. Forget one of the parentheses somewhere the code\n        For this error, google search provided me a clear and correct answer to the error message. Chatgpt also responded

In [4]:
'Somewhat'

'Somewhat'