-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repository size #1
Comments
I can update the audio links to point to the versions in the published feeds. Then I can remove the audio from radiotopia.fm. Would that help? |
Most of the audio is already coming from our CDN. The exceptions are the two podcasts that don't use our CDN, Love+Radio and The Truth. I can remove about 500MB of the audio (leaving about 70MB of audio). @farski Is that what you would like? |
Most of the audio in the site right now already seems to be coming right from show-specific CDN URLs, which is fine. It wouldn't even be bad if the files were all coming from out S3 like they were originally. The location of the files doesn't really matter, as long as they aren't getting checked into git. Deleting the files at this point wouldn't help the repo size since all the history would obviously still be there. The data would need to get removed from the git history, which isn't something I'm too familiar with. https://help.github.com/articles/remove-sensitive-data/ I would say if we're going to do the work to fix this there should be no audio left in the repo. It's find to leave it on S3, but it should be managed outside of git. |
And why remove it from the repo? Do we pay for that storage? If so, how much per GB/year? |
It is more about not managing them in the repo at all. Git doesn't care if We can prune them from the repo but it would mean as well that we need a We do not really pay for storage (github has a soft cap) but I do know from This is all part of a larger conversation we need to have about where to Thoughts?
|
Mostly a matter of convenience, but also convention and the fact that github really doesn't like it. As it is now if someone wanted to checkout the repo to fix a typo or something they'd need to download over a gig of data. By comparison, prx.org is a huge app and has seven years worth of commits and is only ~600 MB. It's not really common for large media files to get checked in, and github can start rejecting them: https://help.github.com/articles/working-with-large-files/ |
I've put "*.mp3" in my .gitignore file; but that will only stop the problem from recurring. Removing the files from the history is tricky. So far this is the best article I've found on that:
As you may have surmised, I only know the most basic git commands/conventions. I can put together a plan for removing the mp3 files from the history, but I would need someone to review it. |
Fixing the size, either by pruning the files from the existing repo or just creating a new repo (I put the site in git originally because it's the easiest way we have to centrally own files, not because the history is particularly useful to us), should be pretty easy. As Chris mentioned the thing we need to figure out is how to manage these files going forward. If we can eliminate the need to host any just for the site and can switch them all to external sources that seems like an easy solution. If we know that at some point (now or in the future) there will be audio files that exist just for the site then...I don't know. We either could just have some very strong policy (copies on S3 and the office projects NAS) or investigate something else. It's hard to imagine these files ever being something we (PRX proper) are producing, so they really shouldn't ever be something we are solely responsible for anyway. |
@farski This may seem unrelated, but it is not. Can I change autoLoad to false in main.js? One reason I do not grab the audio from the exact URL in the feed is that I do not want the autoLoad for all the audio on the page to artificially inflate the stats. If I can turn off autoLoad then I have no concern that using the audio URLs from the feed will unduly inflate the stats. |
I don't have much say over that. They are autoloaded so people don't have to wait, which seems like a good user experience. It's also why all the files were originally hosted by us; we could guarantee performance and not impacting the numbers. |
As far as managing audio files, they can either stay where they are in the radiotopia bucket, and the bucket and repo would just stay out of sync on purpose, or we could move the files to another bucket keeping the radiotopia bucket in sync with the repo, but meaning there are two places to think about for this site. I would lean towards the latter, but I really don't touch this property anymore, so I think you can make the call. As long as it's documented in this repo's Readme we should be fine. |
agreed with @farski on s3 storage - probably less possibility of wiping out the files if they are under media, perhaps easier workflow if they are under the same radiotopia bucket, hard to say which is better, I leave it to you @debenedictis since you are managing it (though I can make the choice if that helps). I really have no idea how much this would inflate metrics anyway, but at at least for podtrac files, you can get the url podtrac is redirecting to instead of the podtrac url, and avoid inflation. Like @farski I err on the side of better user experience, and would prefer we host as little as possible, so to me that means autoload should be on, and when possible use files where from URLs where they are already hosted (without podtrac redirects). |
I think that for most cases we can just reference the audio with the URL from the feed. All of the requests I've had to update the audio on Radiotopia.fm have been to do so with specific episodes from existing feeds. I'll test turning off autoLoad to determine how much the user experience degrades. |
@kookster I can remove all the audio files from Radiotopia.fm and just use the URL feeds (without the Podtrac tracking). This is a problem for The Truth and Love+Radio since they use a stats system where they get credit even if I bypass Podtrac. I can workaround that though. One reason to turn of autoLoad, though, is so that listens that do occur on Radiotopia.fm can be credited to the shows that are listened to. |
I realize the soundcloud stats will be affected; I think that is a fine |
@farski I am thinking of removing the mp3 files with the BFG Repo-Cleaner: Do you have any questions or concerns regarding that? |
I'm not familiar with it. As long as you have a backup of the current repo there's not really any risk, though. |
I know how to use filter-branch pretty well so if this doesn't work (though
|
@farski @chrisrhoden thank you |
@debenedictis I know that popup archive used that same tool for a similar purpose, so I believe it should work. |
I ran |
This repo has grown to over 1 GB, which is a little crazy considering the site itself is ~10 MB and hardly changes. I think it's worth figuring out a better way to handle the audio files. I had originally checked in some that were edited website-specific versions; that was probably a mistake to being with. More files have been checked in over time, though, and at this point even if they are just for the website we should find somewhere else to manage them.
@chrisrhoden @debenedictis any thoughts?
The text was updated successfully, but these errors were encountered: