# About
This notebook (will) provide(s) an output contaning a data dictionary from any dataframe.

# Summary

Columns in the deliverable output:


1. Colname
2. Short name (easy way to understand colname)*
3. Type
4. null and empty count
5. Outliers (numeric)
6. Length (non-numeric)
7. Dimension/measure/record**
8. Description *
9. Data example
10. Allowed values **
11. List possible values
12. Team source ***
13. Source ***
14. Size (memory)

'* Needs human input <br>
'** Python helps, but needs human input <br>
'*** Company usage <br>

**Ideas of how to create a data dictionary** 

1. https://blog.panoply.io/how-to-create-a-data-dictionary 
2. https://peter-easter-do.medium.com/creating-a-data-dictionary-with-python-cccb212e44dc


# Packages

In [1]:
import pandas as pd
import pathlib
from zipfile import ZipFile
import calendar
import re

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.5f}'.format)

# Functions

In [None]:
def create_data_dictionary(data:pd.DataFrame):

    """
    Create a data dictionary summarizing various aspects of the dataset.

    Parameters:
    - data (pd.DataFrame): The input dataset.

    Returns:
    - pd.DataFrame: Data dictionary containing feature names, data types, null counts, etc.
    """
     
    data_dict = data.dtypes
    data_dict = data_dict.reset_index()
    data_dict = data_dict.rename(columns = {'index': 'Feature', 0: 'Data Type'})
    data_dict['null_count'] = [data[(data[feature].isnull()) | (data[feature] == '')].shape[0] for feature in data_dict['Feature']]
    data_dict['null_count_perc'] = data_dict['null_count']/(data.shape[0])
    len_features = []
    data_example = []
    nunique_values =[]
    for col in data:
        if data[col].dtypes == "O":
            len_features.append(data[col].str.len().value_counts().index.tolist())
        else:
            len_features.append(None)
        data_example.append(data[col].unique()[:3].tolist())
        nunique_values.append(data[col].nunique())
    data_dict['len_features'] = len_features
    data_dict['data_example'] = data_example
    data_dict['unique_count'] = nunique_values
    return data_dict

# Example

In [3]:
data = pd.read_csv(r'../HASHTAG/data/julho2018.csv')
data.head(2)

  data = pd.read_csv(r'../HASHTAG/data/julho2018.csv')


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,14063,https://www.airbnb.com/rooms/14063,20180713234102,2018-07-14,Living in a Postcard,"Besides the most iconic's view, our apartment ...",,"Besides the most iconic's view, our apartment ...",none,Best and favorite neighborhood of Rio. Perfect...,,Everything is there. METRO is 5 min walk. Dir...,,,strictly no smoking in the apartment ! We want...,,,https://a0.muscache.com/im/pictures/66421/ae9b...,,53598,https://www.airbnb.com/users/show/53598,Shalev,2009-11-12,FL,"Hello , my name is Shalev , I am an orchestra ...",,,,f,https://a0.muscache.com/im/users/53598/profile...,https://a0.muscache.com/im/users/53598/profile...,Botafogo,1.0,1.0,"['email', 'phone', 'reviews', 'jumio']",t,t,"Rio de Janeiro, RJ, Brazil",Botafogo,Botafogo,,Rio de Janeiro,RJ,22250-040,Rio De Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.94685,-43.18274,t,Apartment,Entire home/apt,4,1.0,0.0,2.0,Real Bed,"{TV,Internet,""Air conditioning"",Kitchen,Doorma...",,$151.00,$934.00,"$3,061.00","$1,162.00",$116.00,2,$39.00,60,365,5 months ago,t,28,58,88,363,2018-07-14,38,2010-01-03,2018-03-04,91.0,9.0,9.0,9.0,9.0,9.0,9.0,f,,,f,f,strict_14_with_grace_period,f,f,1,0.37
1,17878,https://www.airbnb.com/rooms/17878,20180713234102,2018-07-14,Very Nice 2Br - Copacabana - WiFi,Please note that special rates apply for New Y...,- large balcony which looks out on pedestrian ...,Please note that special rates apply for New Y...,none,This is the best spot in Rio. Everything happe...,,Excellent location. Close to all major public ...,The entire apartment is yours. It is a vacatio...,I will be available throughout your stay shoul...,Please leave the apartment in a clean fashion ...,,,https://a0.muscache.com/im/pictures/65320518/3...,,68997,https://www.airbnb.com/users/show/68997,Matthias,2010-01-08,"Rio de Janeiro, Rio de Janeiro, Brazil",I used to work as a journalist all around the ...,within an hour,100%,,f,https://a0.muscache.com/im/pictures/67b13cea-8...,https://a0.muscache.com/im/pictures/67b13cea-8...,Copacabana,2.0,2.0,"['email', 'phone', 'reviews']",t,f,"Rio de Janeiro, Rio de Janeiro, Brazil",Copacabana,Copacabana,,Rio de Janeiro,Rio de Janeiro,22020-050,Rio De Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.96592,-43.17896,t,Condominium,Entire home/apt,5,1.0,2.0,2.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",,$306.00,,,$0.00,$310.00,2,$58.00,4,90,3 days ago,t,22,23,48,299,2018-07-14,210,2010-07-15,2018-06-18,93.0,10.0,9.0,10.0,10.0,9.0,9.0,f,,,t,f,strict_14_with_grace_period,f,f,1,2.16


## type

In [4]:
data_dict = data.dtypes

In [5]:
data_dict = data_dict.reset_index()
data_dict = data_dict.rename(columns = {'index': 'Feature', 0: 'Data Type'})
data_dict

Unnamed: 0,Feature,Data Type
0,id,int64
1,listing_url,object
2,scrape_id,int64
3,last_scraped,object
4,name,object
5,summary,object
6,space,object
7,description,object
8,experiences_offered,object
9,neighborhood_overview,object


In [13]:
data[(data['space'].isnull()) | (data['space'] == '')].shape[0]

16192

In [7]:
data_dict['null_count'] = [data[(data[feature].isnull()) | (data[feature] == '')].shape[0] for feature in data_dict['Feature']]
data_dict['null_count_perc'] = data_dict['null_count']/(data.shape[0])

In [9]:
data['listing_url'].str.len().value_counts()

listing_url
37    26895
36    10725
35     1192
34       63
Name: count, dtype: int64

In [10]:
len_features = data['listing_url'].str.len().value_counts().reset_index()
data_example = data.iloc[0].reset_index()

In [12]:
data_dict

Unnamed: 0,Feature,Data Type,null_count,null_count_perc
0,id,int64,0,0.0
1,listing_url,object,0,0.0
2,scrape_id,int64,0,0.0
3,last_scraped,object,0,0.0
4,name,object,74,0.0019
5,summary,object,913,0.02349
6,space,object,16192,0.41651
7,description,object,189,0.00486
8,experiences_offered,object,0,0.0
9,neighborhood_overview,object,19482,0.50114


1. Colname
2. Short name (easy way to understand colname)*
3. type
4. null and empty count
5. length
6. Dimension/measure/record**
7. Description *
8. Data example
9. Allowed values **
10. List possible values
11. Team source ***
12. Source ***
13. Size (memory)