In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv"
df = pd.read_csv(url)
df.isna().sum()

row_n           0
id              1
name            0
gender          0
species         0
birthday        0
personality     0
song           11
phrase          0
full_id         0
url             0
dtype: int64

In [3]:
# question 2
rows, columns = df.shape
print(f'The dataset has {rows} rows and {columns} columns.')

The dataset has 391 rows and 11 columns.


In [4]:
# question 3
# Get summary statistics for numerical columns
df.describe()

Unnamed: 0,row_n
count,391.0
mean,239.902813
std,140.702672
min,2.0
25%,117.5
50%,240.0
75%,363.5
max,483.0


In [5]:
# Get summary statistics for object columns
df.describe(include='object')

Unnamed: 0,id,name,gender,species,birthday,personality,song,phrase,full_id,url
count,390,391,391,391,391,391,380,391,391,391
unique,390,391,2,35,361,8,92,388,391,391
top,admiral,Admiral,male,cat,1-27,lazy,K.K. Country,wee one,villager-admiral,https://villagerdb.com/images/villagers/thumb/...
freq,1,1,204,23,2,60,10,2,1,1


In [6]:
# Get value counts for a specific column, e.g., 'gender'
df['gender'].value_counts()

gender
male      204
female    187
Name: count, dtype: int64

In [20]:
titanic_url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
titanic_df = pd.read_csv(titanic_url)
titanic_df.isna().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [8]:
titanic_df.count

<bound method DataFrame.count of      survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0           0       3    male  22.0      1      0   7.2500        S   Third   
1           1       1  female  38.0      1      0  71.2833        C   First   
2           1       3  female  26.0      0      0   7.9250        S   Third   
3           1       1  female  35.0      1      0  53.1000        S   First   
4           0       3    male  35.0      0      0   8.0500        S   Third   
..        ...     ...     ...   ...    ...    ...      ...      ...     ...   
886         0       2    male  27.0      0      0  13.0000        S  Second   
887         1       1  female  19.0      0      0  30.0000        S   First   
888         0       3  female   NaN      1      2  23.4500        S   Third   
889         1       1    male  26.0      0      0  30.0000        C   First   
890         0       3    male  32.0      0      0   7.7500        Q   Third   

       who  adult_

In [21]:
titanic_df.shape

(891, 15)

In [22]:
titanic_df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


df.shape gives only the shape of the pandas DataFrame, while
df.describe() is a function that gives various statistics about numerical values in the pandas DataFrame
df.shape does not need to be calculated because it already is stored in the DataFrame, whereas
df.describe() calculates the values from scratch every time you call it

count - the amount of non-missing values in the column
mean - average value in the column
std - standard deviation in the column
min - smallest value in the column
25% - Q1, or the value below which 25% of the values in the column fall
50% - median, or the value below which 50% of the values in the column fall
75% - Q3, or the value below which 75% of the values in the column fall
max - biggest value in the column

the function df.describe can give different output if you pass the value 'object' in the parameter 'include'
the return values in this case would be:
count - the amount of non-missing values in the column
unique - the amount of unique values
top - the most frequently occured unique value
freq - amount of times the most frequently unique value have occured in the column

del df['col'] removes entire columns from a DataFrame
df.dropna() removes rows that contain missing values

del df['col'] should be used to remove attributes irrelevant to the analysis
For example, you could remove the column 'name' from the Animals Crossing dataset because it most likely won't have any effect on the statistics
df.dropna() should be used to remove compromised/unfinished data in the dataset to prevent it from polluting the statistics
For example, you could use df.dropna() on the Titanic dataset to remove data that would otherwise affect the statistics in a poor way

It would be a good idea to delete all the irrelevant columns from the dataset first by using del df['col'] before using df.dropna()
since most of the time you wouldn't consider a row compromised if it is only missing values irrelevant to the study

In [28]:
titanic_df.dropna().describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,182.0,182.0,182.0,182.0,182.0,182.0
mean,0.675824,1.192308,35.623187,0.467033,0.478022,78.919735
std,0.469357,0.516411,15.671615,0.645007,0.755869,76.490774
min,0.0,1.0,0.92,0.0,0.0,0.0
25%,0.0,1.0,24.0,0.0,0.0,29.7
50%,1.0,1.0,36.0,0.0,0.0,57.0
75%,1.0,1.0,47.75,1.0,1.0,90.0
max,1.0,3.0,80.0,3.0,4.0,512.3292


In [30]:
del titanic_df['age']
del titanic_df['deck']
titanic_df.dropna().describe()

Unnamed: 0,survived,pclass,sibsp,parch,fare
count,889.0,889.0,889.0,889.0,889.0
mean,0.382452,2.311586,0.524184,0.382452,32.096681
std,0.48626,0.8347,1.103705,0.806761,49.697504
min,0.0,1.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,0.0,7.8958
50%,0.0,3.0,0.0,0.0,14.4542
75%,1.0,3.0,1.0,0.0,31.0
max,1.0,3.0,8.0,6.0,512.3292


In the above example, we were able to calculate the median fare way more accurately than before removing the two columns with the most missing values, age and deck

In [31]:
titanic_df.groupby('survived')['sex'].describe()

Unnamed: 0_level_0,count,unique,top,freq
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,549,2,male,468
1,342,2,female,233


The reason behind why the attribute df.count returns a value different that the field 'count' of the return value of the .describe() function depending on what column we want to describe is that the attribute df.count returns the amount of rows in general while df.describe() puts the amount of non-zero entries in each column for its 'count' value.
For example, in the titanic dataframe, we're missing ages of 177 passengers, and while df.count returns the amount of passengers, df['age'].describe() returns the amount of ages that are present in the column, which is less than the amount of passengers overall

In my experience, it is way more reliable to first try to identify the error in code by reading the error message, then googling it afterwards, and only then pasting it to a chatbot with appropriate context. There is one exception to this rule: if the stack trace did not include any of your code but only has framework method calls, you would better be off by pasting it to a chatbot first, and only if the generated solution doesn't work then start googling the error.