# Mud card and piazza questions

## General questions

- **What material do you suggest us to refer to when we want to know the information about a package usage except for python documentation? For example, if I want to make some calculations inside the dataframe and I want to know if there are any functions or syntax I can use in pandas to help me save a lot of work.**
    - stackoverlow
    - google your problem e.g., 'pandas count unique elements in column'
    - look for hits either on the pandas website or stackoverflow to find your solution
    - like [this](https://stackoverflow.com/questions/45759966/counting-unique-values-in-a-column-in-pandas-dataframe-like-in-qlik/45760042) or [this](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html)
    - I've been coding in python for roughly 10 years and I did not come across a coding problem that was not already solved by someone on stackoverflow.
    

- **Stepping back from EDA itself a bit, I was wondering about the formulation of a machine learning question. For example, I know in econometrics you are looking to understand causality. I think for machine learning, it is not limited to causality. I was hoping to have a better understanding of examples of questions we are trying to answer.**
    - ML is mostly about predictive power and automatization
    - If a task is repetitive and potentially labor-intensive for humans, an ML model might speed things up
    - Examples:
        - grading essays
        - predict if a patient is sick or not based on medical data
        - predict if a bank transaction is fraud or not


- **I am struggling with the use of the symbol " . " ; how can I identify when I should use the symbol in python?**
    - you use the dot if you want to 
        - use a function from a package (e.g., pd.read_csv(), plt.plot())
        - create an instance of a class or in other words an object (e.g., pd.DataFrame())
        - use a method specific to an object (e.g., df.head(), df.describe())

- **I was wondering what the common ways of dealing with missing values in our data is. I don't mean filling in the missing values, but just generally how can we visualize how much is missing, how much can we allow to be missing in order to proceed with building the model, and maybe whether there are any fast handy tricks/common strategies to deal with data with a lot of missing values that doesn't involve filling them in.**
    - We cover simple techniques of dealing with missing values in week 4 and revisit again in November with three advanced techniques
    - PS3 has one exercise on manipulating missing values in a dataframe.

## Pandas questions

- **I'm still unsure about loading sql data into pandas df's. Should we load sql tables into local memory first and then load them as csv's?**
    - I was indeed a bit vague about this because it depends on the type of SQL database you use (mySQL, postgreSQL, etc.)
    - you need to establish a connection to a SQL database using e.g., `sqlalchemy`
    - then you can submit standard SQL queries and save the output directly into a dataframe
    - no need to load tables into local memory first, this approach is faster and more memory efficient

- **Can you explain further how to fix a character encoding problem when loading in a csv file.**
    - *pd.read_csv([path_to_file], encoding = 'utf-8')*
    - [here](https://docs.python.org/3/library/codecs.html#standard-encodings) is a list of encoding supported by python

- **Since .loc and .iloc are almost the same when working with a range index, are there some cases when one is preferred over the other?**
    - it's a matter of preference I think
    - I am used to working with numpy arrays so iloc is easier for me to follow
    - if you work with a large dataset and runtime is an issue, you should try and time both approaches and check if one is faster than the other.

- **Why is it that when I run code from**

*print(df[(df['hours-per-week'] >= 60)&(df['education']=='Doctorate')].shape)*

- **my outcome is (33,15) but the answer is 96 ? Am i formatting something wrong? I ran al the previous code from the lessons prior**
    - 33 is the number of rows selected, and 15 is the number of columns
    - you could do .shape[0] to get back the number of rows, and .shape[1] will give you the number of columns
        - if the question is how many people fulfill a certain set of conditions, you need the number of rows because there is one row per person in the adult dataset.
    - the second issue is that you are using adult_test.csv but the instruction asks you to use adult_train.csv
    - be mindful of what dataset is loaded into which data frame

- **Is there a way to normalize the counts against the max in their respective categories instead of the sum of all the counts?**
    - that's exctly what we did when the normalized count matrix was created
    - *count_matrix_norm = count_matrix.div(count_matrix.sum(axis=1),axis=0)*

- **I was very confused on the third question about the size of df_merge, as I just was not sure where that data frame came from.**
    - It's just some dummy data in the lecture notes in a form of python dictionaries which you need to convert into dataframes as part of the exercise
    - let's go through it together

- **Is df1.merge(df2, how = 'right', on = 'ID') equivalent to df2.merge(df1, how = 'left', on = 'ID') ?**
    - yes, but don't believe me, just try it.
    - the column orders might be different but the two solutions are equivalent

- **What is the real difference between a right and left merge, or are they essentially the same operation written in two ways? And is there a time to use one over the other?**
- **I'm still a bit confused how exactly the merge method works. How do I know whether to use a left or a right merge?**
- **Prior to doing a merge between dataframes, are there some criteria to consider as to which one is on the 'left' or the 'right'? Is it more important to just stay consistent once you choose?**
    - `df1.merge(df2,how='left',on='ID')` - the merged dataframe will contain the IDs from df1
    - `df1.merge(df2,how='right',on='ID')` - the merged dataframe will contain the IDs from df2
    - which one you should use depends on the question you are trying to answer

- **It seems the 'merge' and 'join' functions in python are similar. Merge seems more versatile. When do you suggest using merge, and when join (if ever)?**
    - I am pretty method-agnostic generally.
    - As long as you answer the question correctly, you can use either, I have no preference.

## Plotting questions

- **One question I had was if it is a good idea to use Kernel Density Estimation for making heat-maps instead of histograms? I would imagine that it would produce a more smooth looking plot that would make the structure more visible.**
    - Be careful with KDE because it can produce smooth but unrealistic figures
    - E.g., salaries are smoothed into the negative range

- **I had the most trouble with identifying the correct type of plot for different kinds of data (i.e. the last quiz before the last video)**
- **The different plot part is a bit confusing. It takes some efforts to understand different types of data and which plot is best suited to visualize the data.**
    - let's go through Quiz 5 together

- **A cheat sheet that summarizes the key for each graphes and how to tune different features of the graph would be really helpful.**
    - check out the matplotlib cheat sheet linked in the last cell of the lecture notes
    - [Matplotlib cheatsheets](https://github.com/matplotlib/cheatsheets)

- **If I use %matplotlib inline after importing packages, I wonder whether plt.show() is still necessary**
     - Me too. :) Try it without plt.show().

- **In what situations would a box plot be preferable to a violin plot (and vis versa)?**
    - both plot type is great to visualize a categorical vs. continuous features
    - this is a subjective decision, depends on your preferences
    - personally I prefer violin plots because it shows the distributions better
    - others prefer the box plot because it is easier to read quantiles and percentiles off

- **It would be helpful if you went through why certain graphs are better for continuous vs categorical data. I think this would help me remember which to use, with intuition, rather than refer back to a the chart as a reference.**
    - create bad plots :)
    - try to create a scatter plot using two categorical features, you'll see why this is bad.
    - try to create a bar plot using a continuous feature, it won't look good because each value might be unique

- **In the visualizations, are you choosing a particular 'alpha' value, or testing different values to see what looks best?**
    - I always experiment a bit

- **Also I'm wondering how to show a histogram's axis with log bins.**
    - you need to manually create the log bins using for example `np.logspace` and then using `plt.semilogx()`

- **I'd also like some clarity on the nuts and bolts of the code that produces these 2D visualizations, particularly heatmaps.**
    - the best way to go about that is to check the manuals of the plotting functions and play around with the arguments I use.

- **In your video example, there was a divide by zero error related to the log10 function. I got the same error on my system. Is that particular error something that could alter the visualization?**
    - It doesn't alter the visualization, it just shows up as white squares
    - If the warning bothers you, look for solutions to fix this.

- **The multiple columns plots didn't work for me so I'm still a little confused about how to change that code to suit what I want.**
    - I'm not sure what's going on here but this is an excellent exercise to gain experience in debugging.
    - Go ahead and investigate the issue and let me know when you found a fix.

*pd.plotting.scatter_matrix(df.select_dtypes(int), figsize=(9, 9),*

*c = pd.get_dummies(df['gross-income']).iloc[:,1],marker='o',hist_kwds={'bins': 50}, s=30, alpha=.1)*

- **When should I use get_dummies to plot?**
    - get_dummies is used to determine the colors of the points in the scatter matrix so you should use it if you want to prepare a plot using all continuous features and use one categorical feature to color the points

- **For the purpose of this course is it enough that you are able to create the visualizations using external resources (like stackoverflow) or should you be able to code them without looking.**
    - you can use any external resource!
    - having said that, you will have a time limit during the exams so you need to be able to code sufficiently quickly